Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2023 Feb 13;19(2):e1010410. doi: 10.1371/journal.pgen.1010410

Bayesian inference of admixture graphs on Native American and Arctic populations

Svend V Nielsen 1,#, Andrew H Vaughn 2,*,#, Kalle Leppälä 1,3, Michael J Landis 4, Thomas Mailund 1, Rasmus Nielsen 5,6
Editor: Sharon R Browning7
PMCID: PMC9956672  PMID: 36780565

Abstract

Admixture graphs are mathematical structures that describe the ancestry of populations in terms of divergence and merging (admixing) of ancestral populations as a graph. An admixture graph consists of a graph topology, branch lengths, and admixture proportions. The branch lengths and admixture proportions can be estimated using numerous numerical optimization methods, but inferring the topology involves a combinatorial search for which no polynomial algorithm is known. In this paper, we present a reversible jump MCMC algorithm for sampling high-probability admixture graphs and show that this approach works well both as a heuristic search for a single best-fitting graph and for summarizing shared features extracted from posterior samples of graphs. We apply the method to 11 Native American and Siberian populations and exploit the shared structure of high-probability graphs to characterize the relationship between Saqqaq, Inuit, Koryaks, and Athabascans. Our analyses show that the Saqqaq is not a good proxy for the previously identified gene flow from Arctic people into the Na-Dene speaking Athabascans.

Author summary

One way of summarizing historical relationships between genetic samples is by constructing an admixture graph. An admixture graph describes the demographic history of a set of populations as a directed acyclic graph representing population splits and mergers. The greedy search algorithms that are typically used to infer admixture graphs may fail to find the globally optimal graph. We here improve on these approaches by developing a novel MCMC sampling method, AdmixtureBayes, that can sample from the posterior distribution of admixture graphs. This enables an effective search of the entire state space as well as the ability to report a level of confidence in the sampled graphs. We apply AdmixtureBayes to a set of Native American and Arctic genomes to reconstruct the demographic history of these populations and report posterior probabilities of specific admixture events. While some previous studies have identified the ancient Saqqaq culture as a source of introgression into Athabascans, we instead find that it is the Siberian Koryak population, not the Saqqaq, that serves as the best proxy for gene flow into Athabascans.

Introduction

Admixture graphs [1] provide a concise description of the historical demographic relationships between genetic samples of populations, assuming their relationships are the product of discrete, instantaneous splits and admixture events. The assumption of discrete, instantaneous events is clearly an oversimplification for most real data, but it facilitates interpretation and makes admixture graphs a popular first step in analyses. Each graph topology is associated with parameters capturing genetic drift and admixture proportions, and once these are fitted to genetic data, the goodness of fit can be measured to determine how accurately the graph captures the historical relationship between samples. Inferring graph topologies, however, involves a combinatorial search, and since the space of graphs grows super-exponentially in the number of populations and the number of admixture events, an exhaustive search is typically not possible. Instead, the search for well-fitting topologies is often done manually or through greedy algorithms.

The most popular methods for estimating admixture graphs are TreeMix by Pickrell and Pritchard [2], qpGraph by Patterson et al. [1], and OrientAGraph by Molloy et al. [3], all of which take a greedy approach to searching the state space of graph topologies. qpGraph allows users to sequentially identify the best phylogenetic position of a possibly admixed population in a previously established admixture graph and evaluate the improved fit in terms of simple allele-sharing statistics. The program MixMapper by Lipson et al. [4] employs a similar strategy and has options for fitting up to two admixture events simultanously. TreeMix estimates an admixture graph de novo by automatically estimating the best tree without admixture events followed by automatic, sequential insertion of the admixture branches. In contrast to MixMapper and qpGraph, TreeMix searches through potential admixture graphs without user input by way of an efficient greedy heuristic. OrientAGraph is based on the TreeMix approach, but adds a technique called maximum likelihood network orientation, which helps avoid getting stuck in incorrect local optima during the optimization process. The method miqograph employs mixed-integer quadratic optimization to search through the space of admixture graphs but restricts admixture events to the leaf nodes of the graph [5].

To penalize deviations from the expected and observed allele sharing statistics, all five methods use a Gaussian model for the distribution of allele frequencies among populations. The implicit assumption in the Gaussian model is that changes in allele frequency due to genetic drift can be approximated as a Brownian motion process. This assumption dates back to the early work by Edwards and Cavalli-Sforza [6] and has recently re-emerged as a computationally attractive alternative to the full Wright-Fisher process. It has previously been used in several other methods aimed at modeling the joint distribution of allele frequencies among populations [7] [8] [9].

There are also phylogenetic network methods that infer admixture graphs using sets of locus-specific gene trees as nuisance parameters which are either pre-estimated [10] or integrated out using MCMC [11] [12]. These approaches must evaluate the likelihood of each gene tree separately, making them more computationally expensive and therefore limited to fewer populations than the Gaussian drift models. To handle larger datasets, some methods summarize all the gene trees into a few statistics that are evaluated with a pseudolikelihood [13] [14] for a small reduction in accuracy [14]. In terms of speed, these pseudolikelihood methods are similar to the Gaussian drift methods. However, the Gaussian approach offers a way to instead approximate a true likelihood (rather than a pseudolikelihood), which we use in this paper.

The greedy search algorithms used by current methods do not guarantee that the inferred graph is optimal. In practice, the optimal graph found by a greedy search can potentially be very different from better-fitting, but never-discovered, graphs [3] [15]. Regardless of whether a search finds the optimal graph or not, if a single graph is inferred and used for all downstream analysis, that point estimate would not intrinsically report confidence in various estimated features, such as the topology of relationships among populations, the presence or absence of admixture events, and the intensity of those events. There might be many graphs that fit the data equally well, and we should have more confidence in features shared among many of them than we should in features that are only found in some of them; shared features are most likely signals in the data while those that rarely occur are most likely spurious. Analyses based on a single graph do not distinguish between features that are estimated with high confidence and those estimated with low confidence. While it is possible to generate a distribution of TreeMix graphs across independent analyses of bootstrap replicates, it is rarely done in practice.

Here, we provide an alternative to greedy searches. Based on a model similar to TreeMix and qpGraph, we develop a Bayesian approach to sample over the graph-space using a reversible jump Markov Chain Monte Carlo (MCMC) method. The method can identify the graph with the highest likelihood encountered by the MCMC sampler, thereby effectively working as a heuristic maximum-likelihood optimizer, or it can report several summaries of the posterior distribution of admixture graphs. For example, it can estimate the graph topology with the highest marginal posterior when integrated over admixture and divergence times as measured by occupancy in the MCMC sampler. A marginal posterior is computed in admixturegraph [16] as well, but the exhaustive search algorithm of admixturegraph finds the graph with the highest posterior—not the graph shape with the highest marginal posterior. A particular strength of our new method is that it circumvents the need to report a single best graph by allowing calculations of posterior probabilities of particular marginal relationships between populations. We consider three approaches for this: one based on simplifying admixture graphs into simpler structures, one based on summarizing shared topologies into a consensus graph, and one based on subgraph analysis. If the number of leaves in the considered subgraph is kept small, we will observe few distinct subgraphs with these leaves, and we can estimate a complete posterior distribution over these graphs. Sampling subgraphs from the space of full graphs allows us to incorporate information from other populations when exploring the relationship between a subset of the populations.

We illustrate the utility of our method using simulations and reanalyze a previously published genomic dataset of Siberians and Native Americans [17]. We use the method to revisit two important and controversial questions in the history of the peopling of the Americas. First, we analyze the origin of the Inuit and show that they are modeled best as an admixture between a population represented by the Saqqaq genome, and a population represented by Athabascans. Secondly, we show that Athabascans are best represented as admixed between a Native American population and a Siberian population most closely related to the Koryak, but not the Saqqaq.

Results

Method overview

We here present AdmixtureBayes, an MCMC algorithm for sampling admixture graphs from their posterior distribution, given a set of genetic data from multiple populations.

We begin by presenting our formal definition of an admixture graph. An admixture graph consists of a topology and a set of continuous parameters. The space of topologies for a given number of leaves, L, consists of all uniquely labeled graphs of the set of all directed acyclic graphs which fulfill

  1. There exists one and only one root. That is a node with one parent (the outgroup) and exactly two children.

  2. The number of nodes of degree 1 is L + 1. L of these nodes have only one parent and are called leaves. 1 of these nodes is called the outgroup and has exactly one child, the root.

  3. If a node is not the root, a leaf, or the outgroup, it has either
    1. 1 parent and 2 children in which case we call it a divergence node.
    2. 2 parents and 1 child in which case we call it an admixture node.
  4. There are no eyes, i.e. the parent nodes of an admixture node are distinct (and the child nodes of a divergence node are distinct).

The labeling consists of

  1. All leaves and the outgroup are given a unique label.

  2. Parent edges of an admixture node can be either a ‘main’ branch or an ‘admixture’ branch. All admixture nodes have one parent edge of each type.

We do not label branches and nodes in general, meaning that even though the the leaves are given a unique label, the leaves themselves are not unique. For example, switching the labels of two leaves that form a cherry in the graph, would not change the graph topology. For a more formal definition, see the definition of topology in S1 Text. All branches have a length in the interval (0, ∞) and all admixture nodes are given an admixture proportion in the interval (0, 1).

The Methods section describes our implementation of a Markov Chain Monte Carlo (MCMC) algorithm, AdmixtureBayes, which samples admixture graphs from their posterior distribution. We summarize genetic data from multiple populations as a matrix that captures how allele frequencies in the data covary between populations. AdmixtureBayes samples graphs that explain this covariance matrix. The topology of any sampled graph captures the relationships between samples as a mixture of the graphically structured covariance matrices. Branch lengths capture the amount of genetic divergence between populations, measured by drift, and admixture events explain shared allelic covariance between otherwise independently evolving populations. As a property of the MCMC algorithm, each graph is sampled at a frequency corresponding to its posterior probability. AdmixtureBayes is available to use at https://github.com/avaughn271/AdmixtureBayes.

Comparisons with TreeMix and OrientAGraph

We compared the accuracy of AdmixtureBayes to TreeMix and OrientAGraph on 4 distinct admixture graphs, shown in Fig 1. We simulated datasets from each of these admixture graphs in msprime [18] by using the add_population_split and add_admixture options and adjusting event times and population sizes until the allele frequency drift terms matched those of the admixture graph.

Fig 1. The graphs G1, G2, G3, and G4 used for the comparisons between methods.

Fig 1

G1 and G2 are not based on any real dataset, but the branch lengths are chosen to have human-like values. Out was used as the outgroup for both graphs. G3 is based on M1 from Molloy et al. (2021), the graph that motivated the development of the MLNO approach of OrientAGraph. We have changed some of the branch lengths. popE was used as the outgroup. G4 is based on Model M7 from Fig 3 of Molloy et al. (2021), which is in turn based on Fig 7a from Wu (2020) [19]. The populations ITU, JPT, and ASW have been removed. The YRI population was used as the outgroup. For all graphs, as in Molloy et al. (2021), branch lengths are not shown to scale and are shown multiplied by 1000. Divergence nodes are shown as circles. Admixture nodes are shown as rectangles. The fractions inside the admixture nodes denote the contribution from the population represented by the dashed line.

We then analyzed all simulated datasets with AdmixtureBayes, TreeMix, and OrientAGraph (see the section “Running AdmixtureBayes, TreeMix, and OrientAGraph” for details). Comparing their accuracy is not straightforward because TreeMix and OrientAGraph produce one graph whereas AdmixtureBayes produces posterior samples of graphs. In addition, TreeMix and OrientAGraph assume a fixed number of admixture events, whereas AdmixtureBayes samples graphs with different numbers of admixture events. We ran TreeMix and OrientAGraph conditioned on the true number of admixture events, while we considered all graphs produced by AdmixtureBayes, even those with the wrong number of admixture events. We note that this could increase the error of AdmixtureBayes. Furthermore, both TreeMix and OrientAGraph allow admixture involving the branch to the outgroup, which AdmixtureBayes does not. The extent to which this was a problem varied between simulation models, so we handled this on a case-by-case basis. We used three metrics to compare the graphs inferred by these methods to the true underyling admixture graph. The Topology Equality is a simple metric that is 1 if the inferred graph has the same topology as the true graph and 0 otherwise. The next metric we considered is the Covariance Distance, defined as the Frobenius distance between the allelic covariance matrix of the true graph and the allelic covariance matrix of the inferred graph (see Methods). Finally, we measured the Set Distance, which we defined as a topological distance measure similar to the Robinson-Foulds metric (S9 Fig; Methods section).

For each of the 4 admixture graphs we analyzed (see Fig 1), we performed the following analysis: 20 independent datasets were simulated using msprime and all three methods were run on each dataset. Then, each of the three metrics was calculated for the results of each method. For AdmixtureBayes, we measured both the accuracy of the sampled graph with the highest posterior (we call this the AdmixtureBayes Mode) and the mean accuracy of a graph sampled from the posterior (we call this the AdmixtureBayes Mean). We plot the values of these metrics across the 20 datasets as boxplots in Fig 2. We also highlight that an excellent comparison of TreeMix, OrientAGraph, and miqograph was done in Molloy et al. [3], which both illustrated OrientAGraph’s ability to infer topologies TreeMix could not and demonstrated that miqograph was unable to infer topologies with deep admixture events.

Fig 2. We here plot the results of our method comparison with TreeMix and OrientAGraph.

Fig 2

For each of the graphs in Fig 1, we simulated 20 datasets and ran each method on each dataset. We compared the accuracy of each method with the 3 statistics discussed in the section Comparisons with TreeMix and OrientAGraph. For AdmixtureBayes, we examined both the Mode graph (the sampled graph with the highest posterior) and the mean value of the statistics when 100 graphs are sampled from the posterior (we refer to this as the AdmixtureBayes Mean). TreeMix and OrientAGraph allow admixture involving the outgroup, an error which AdmixtureBayes is not allowed to make. For fairness, we only plot the results for the graphs not involving admixture with the outgroup. We have listed the number of datasets that resulted in such graphs in parentheses next to the method name on the x-axes. The Topology Equality statistic for TreeMix, OrientAGraph, and the AdmixtureBayes Mode can only be 0 or 1, so we plot a horizontal line at the mean value over the datasets, rather than a true boxplot.

On graph G1, which contains 1 admixture event, all methods perform similarly well. The correct topology was inferred by all methods on all datasets (giving a Set Distance value of 0), and the accuracy of the covariance matrix implied by each of the inferred graphs (as measured by the Covariance Distance) is quite similar.

On graph G2, which contains no admixture events, TreeMix and OrientAGraph are able to infer the correct topology for all 20 datasets. The Mode estimate of AdmixtureBayes also infers the correct topology in all cases. For all datasets, the AdmixtureBayes Mean topologies are highly concentrated on the true topology, though there is some variation. This is to be expected given the inherent noise in the data. It is also worth noting that the incorrectly inferred topologies sampled by AdmixtureBayes may include graphs with an admixture event, an error which we do not allow TreeMix and OrientAGraph to make as we run them with the correct number of admixture events (zero). We note that the AdmixtureBayes Covariance Distance is slightly larger than the TreeMix and OrientAGraph distances. This is to be expected as both of those methods explicitly perform optimization on branch lengths and admixture proportions, which will likely result in a better model fit than the graph AdmixtureBayes samples that happens to have the highest posterior.

On graph G3, which has one admixture event, TreeMix does quite poorly. This is by design, as G3 is based on Model M1 from Molloy et al. [3], which motivated the development of OrientAGraph. In particular, TreeMix incorrectly infers an admixure event involving the outgroup in 17 out of the 20 datasets. Of the 3 remaining datasets, TreeMix was only able to infer the correct topology for 2 of them. We only plot the accuracy statistics for the 3 graphs that do not involve admixture with the outgroup as these are the only graphs that exist in the same state space as AdmixtureBayes. However, we highlight that the boxplots in Fig 2 do not necessarily represent all simulated datasets.

In contrast to TreeMix, OrientAGraph never infers admixture involving the outgroup and infers the correct topology in almost 80% of all datasets. AdmixtureBayes, however, outperforms both methods by inferring the correct topology for all datasets, both using the Mode estimate and the Mean estimate. We attribute this to a superior framework for exploring the state space of topologies. We still note that TreeMix and OrientAGraph provide better estimates of branch lengths and admixture proportions, which we again attribute to the fact that AdmixtureBayes is not designed for optimizing the likelihood function for branch lengths but instead provides posterior distributions. If point estimates for branch lengths are of interest, we recommend that users optimize the branch lengths using other methods with the AdmixtureBayes Mode topology fixed.

Graph G4 represents a very complicated topology and is based on a model used by Molloy et al. [3] to represent the shortcomings of OrientAGraph. TreeMix incorrectly infers an admixture involving the outgroup for all datasets, so we do not plot the results from running TreeMix. OrientAGraph incorrectly infers an admixture involving the outgroup for 14 datasets, leaving 6 datasets to compare with AdmixtureBayes. We see that OrientAGraph never infers the correct topology and never has a Set Distance of less than 4. In contrast, the AdmixtureBayes Mode estimate represents the correct topology for more than half of all datasets, which we again attribute to a superior framework for exploring the state space of topologies. The AdmixtureBayes Mean estimates are fairly noisy, but still represent a posterior distribution that is often concentrated on the true topology. The optimization employed by OrientAGraph results in a lower Covariance Distance than AdmixtureBayes, even in the presence of an incorrect topology. Performing a similar optimization on the AdmixtureBayes Mode topology will likely yield a smaller Covariance Distance if a point estimate of an admixture graph with branch lengths is desired. From these results, we conclude that the MCMC framework of AdmixtureBayes provides an effective algorithm for searching through the topology space of admixture graphs and often infers the correct topology when other methods do not. All of the scripts used to run these simulations can be found in the SimulationStudy folder on the AdmixtureBayes GitHub.

Exploring the genetic history of Saqqaq, Inuit and Native Americans

In the simulation section above, we demonstrated that AdmixtureBayes includes an effective algorithm for exploring the space of admixture graphs. However, the real advantage of the method is in its ability to quantify probabilities of graphs and subgraphs, and thereby to provide measures of statistical uncertainty. We illustrate the utility of the method on a set of previously published Siberian and Native American samples [17] to explore the relationship between Siberian Chukotko-Kamchatcan speakers (Koryak), an ancient individual from the extinct Saqqaq culture (Saqqaq), Inuit-Yupik-Unangan speakers (Greenlandic Inuit), and Na-Dene speakers (Athabascan). The dataset also contained North and South Americans (Anzick, Aymara) and various other groups. We chose the Yoruba population as the outgroup. Running time of AdmixtureBayes was 50 hours in parallel on 32 cores.

To extract information from the posterior distribution of admixture graph topologies, we introduce two ways of summarizing relationships among sets of focal populations (for details, see Methods). Both are based on summarizing each sampled admixture graph in the posterior into a topology set, which is the set of all nodes labeled by their descendants. This discards information about the number of and timing of admixture events (see S9 Fig). From such a topology set, we can create the minimal topology, which is the ‘simplest’ directed graph yielding the same topology set (see S10 Fig). The two minimal topologies with the highest posterior probabilities are shown in Fig 3. We also considered the frequency of each internal node across posterior samples. In Fig 3 these frequencies are denoted as percentages in parentheses in each node. The second summary of the admixture graph sample is the set of nodes with a frequency higher than α in the topology sets, which we denote as the consensus graph at threshold α. S6 Fig shows this summary for α = 0.75.

Fig 3. The two minimal topologies with the highest posterior probabilities in our real dataset.

Fig 3

Leaf nodes that are the product of an admixture event are shown in purple. Leaf nodes that are not the product of an admixture event are shown in light blue. The root is shown in black. Each inner node is colored according to the posterior probability that the true graph has a node with the same descendants. Higher probabilities have a darker shade of green. The posterior probability is written as a percentage in parentheses inside each node, next to the node name, which is arbitrary. The left graph has a posterior probability of 32%. The right graph has a posterior probability of 19%.

While no single graph received high support when including all data, we can extract subgraphs that are informative about the relationships between specific subsets of populations. With AdmixtureBayes, it is possible to consider the relative support, in terms of posterior probability, of individual subgraphs. Analyzing the support for subgraphs within the context of a larger admixture graph has an advantage over analyses limited to the focal populations represented in the subgraph, that information from other populations can be directly taken into account.

There has been considerable debate about the relationships between populations represented by the Koryak, Saqqaq, Greenlanders, and the Athabascans. Archaeological evidence suggests that the Inuit people from Greenland and people from the now extinct Saqqaq culture represent independent migrations into the Americas from Eastern Siberia and the area around the Bering strait [20] [21] [22]. However, there is some debate about the origin of the Athabascans [21] [23] [24] [25]. Most molecular evidence of Athabascan ancestry is thought to have originated from the first migration of people into the Americas that also gave rise to most other Native American groups, such as the indigenous people in Central and South America. However, some portion of genetic variation in Athabascans seems to have also originated from other groups, perhaps related to Inuit, Saqqaq, or other Siberians such as the Koryak. Naming and identifying sources of genetic variation is further complicated by the fact that these possible reference populations may themselves be admixed. A marginal analysis of the relationship between Koryak, Saqqaq, Greenlanders, and Athabascans, that can take gene flow from other groups into account, is therefore very much wanted.

S7 and S8 Figs depict the subgraphs for different subsets of these groups and for all groups together, extracted from the posterior distribution of graphs from the full dataset. The most strongly supported subgraph for Saqqaq, Athabascan, and Koryak supports the tree ((Athabascan, Koryak), Saqqaq) with 96% posterior probability. This implies that a relationship where the gene flow into Athabascans came from a population closer to the Saqqaq, than to the Koryak from Siberia, is not supported by the data. In contrast, when considering the relationship between Koryak, Athabascans and the Inuit Greenlanders, the most strongly supported admixture graph is a tree with the structure ((Athabascan, Greenlander), Koryak), likely reflecting gene flow into the Inuit Greenlander from Native Americans related to Athabascans.

We emphasize that in these inferences, by analyzing the posterior probability of subgraphs embedded within larger graphs, we have also explicitly modeled the effects of gene flow from other groups including various Siberian, Native American, and East Asian groups. When considering all four populations together, the Greenlanders are best modeled as a population admixed between Athabascan related populations and Saqqaq related populations. Again, there is no apparent gene flow between the Saqqaq and the Athabascans following their initial divergence. We also ran TreeMix and OrientAGraph on our dataset, each with a varying number of admixture events. We plot the results in S16 Fig. All of the scripts used to run the AdmixtureBayes analysis can be found in the RealDataAnalysis folder on the AdmixtureBayes GitHub, and the TreeMix and OrientAGraph results can be found in the OtherMethodsRealData folder.

Discussion

We here present the program AdmixtureBayes, which is a method for inferring admixture graphs using MCMC. On simulated data, it infers graph topologies more accurately than both TreeMix and AdmixtureBayes, likely caused by these methods getting stuck in local optima topologies during the admixture edge addition process. However, we also note that even with AdmixtureBayes, the correct topology is not always inferred, suggesting that the reporting of a single “best graph” may not necessarily be best practice. As is common in phylogenetics, admixture graphs should report measures of statistical confidence for the relationships inferred among internal nodes in the graph, as is reported in this paper.

We also encourage the use of embedded subgraphs as a powerful approach for investigating the relationship between specific populations while taking gene flow from other reference populations into account, as was done in S7 and S8 Figs. The use of posterior probabilities, as reported here, is facilitated by the use of a bootstrap procedure that can estimate the effective number of independent SNPs. In our real data analysis, we obtained information from human genomes corresponding to approximately 40,000 independent SNPs. This number determines the peakedness of the likelihood surface, which directly influences the posterior distribution of admixture graphs. TreeMix and qpGraph employ similar resampling techniques to obtain variance estimates that control the peakedness of their likelihood surfaces.

Our analysis of Native American and Siberian samples largely recapitulates many previous analyses and identifies many admixture events [17]. Furthermore, we find a similar, but not identical topology, to a previous admixture topology [17]. However, our results also indicate that several features of the true admixture graph remain uncertain. For example, we could not definitively resolve the question of introgression into the Han lineage from the ancestral lineage of Ust’-Ishim. Our analysis does not support previous claims that the Saqqaq culture is a good proxy for the source of gene flow into Athabascans [21] [24], although statistical power could still be improved.

In both our analysis and previous work, each population is represented by just one or two diploid individuals. Our simulations suggest that increasing the number of individuals per population might lead to substantially improved statistical accuracy (see S15 Fig). In addition, adding more populations, both modern and ancient, could change the results presented here, as there may be some ancestral components that are not adequately represented by the data we use in this analysis. We also note that the sample quality was relatively poor for some samples analyzed here, particularly the Saqqaq, which has many missing sites.

It is also worth noting that there are many other diverse fields such as linguistics, archaeology, and ethnography that seek to understand the historical relationships between different populations. While we do not incorporate data from these fields into this study, we do think that the results we present here are an important contribution towards clarifying the genetic evidence by improving on algorithms for admixture graph inference and correcting results that may have been caused by suboptimal optimization algorithms. However, we emphasize that the genetic evidence should be used in concert with other diverse fields in order to obtain an accurate picture of the historical movement and cultural development in the Arctic region and that the results in this paper should not, without further context, be used to infer cultural history.

The estimation of admixture graphs is becoming one of the most important tools in population genomics. However, methods for estimating such graphs are still in their infancy. AdmixtureBayes provides a step towards improved estimation and more rigorous quantification of statistical uncertainty in admixture graph inference.

Methods

AdmixtureBayes model

The AdmixtureBayes program searches the posterior distribution of admixture graphs given observed SNP data using a Markov Chain Monte Carlo procedure. To assess the likelihood of an admixture graph we summarize both the admixture graph and the data as covariance matrices of allele frequency changes [2]. The admixture graph covariance matrix is calculated as in TreeMix. Consider the tree structure in Fig 4 where population 2 is a mix of two ancestral populations with proportions w and 1 − w.

Fig 4. An admixture graph for the 3 populations and one outgroup.

Fig 4

Considering a single SNP, the quantities x1, …, x7 are changes in allele frequency, w is the admixture proportion, and P0, P1, P2 and P3 are allele frequencies in the sampled populations. Note that the edge to the outgroup (labeled with x0) is not given a direction. This is because the Gaussian drift model is reversible, meaning that the population split between the outgroup and the other populations could have happened at any point along this branch and identical allelic covariance matrices would be produced. For simplicity, we model the outgroup as the parent of the root node, as described in Method Overview.

The allele frequency in the 4 populations, P0, P1, P2 and P3 are related through the allele frequency changes x0, …, x7 at any SNP.

(P1P2P3)-(P0P0P0)=(x0+x1+x2x0+x7+w(x6+x4)+(1-w)(x5+x2)x0+x3+x4)=A(x0x7). (1)

Notice that A is a matrix that only depends on the admixture graph through the graph structure and admixture proportions. We consider the vector of allele frequency drifts terms (x0x7) to be stochastic because it depends on a random sample of SNPs. In the neutral Wright-Fisher model, changes in allele frequencies due to genetic drift can be approximated by a normal distribution when the allele frequency change is small and the frequency is far from the boundaries at 0 and 1. If xi is the amount of drift from a node with allele frequency pi, then the allele frequency change can be approximated as xiN(0,(1-e-di)pi(1-pi)) where di = ti/2Ni is the number of generations scaled with the population size [6]. We collect the factor (1-e-di) into a single factor ci and substitute the node-specific pi with a SNP-global p giving the tractable, approximate, expression

xiN(0,cip(1-p)).

Consequently, we can approximate the joint distribution of allele frequencies at all leaf nodes as

(P1-P0P2-P0P3-P0)N(0,p(1-p)Σ),Σ=A·diag(c0,,c7)·A* (2)

where matrix Σ is called the admixture graph covariance matrix.

The empirical estimate of the covariance of allele frequencies is denoted the data covariance matrix. In real data we never observe the population allele frequencies but rather the sample allele frequencies. This complicates the computation of the data covariance matrix slightly. Let pij be the sample allele frequency in the i’th population at the j’th SNP, i = 0, 1, …, n, j = 1, …, N. They are assumed to come from the distribution

pij1mijBin(mij,Pij) (3)

where mij is the number of haplotypes sampled and Pij is the population allele frequency.

Denote population i = 0 an outgroup, and consider the intuitive estimate of the covariance matrix

Sk,l=1Nj=1N(pkj-p0j)(plj-p0j) (4)

If there are any missing values in a summand, we leave that summand out of the sum. Regardless of missing values, (4) is inherently biased because the inner term (pkjp0j)(pljp0j) does not have the same mean as (PkjP0j)(PljP0j). From (3) we calculate the difference as

1{k=l}Pkj(1-Pkj)mij+P0j(1-P0j)mij

which suggests the following bias correction term for Sk,l:

B^kl=1{k=l}1Nj=1Npkj(1-pkj)mij-1+1Nj=1Np0j(1-p0j)mij-1.

After correcting, we normalize with

h^=1Nj=1Np¯j(1-p¯j),wherep¯j=1n+1i=0npij

to take the factor p(1 − p) from (2) into account.

If the sample allele frequencies were normally distributed and independent across markers, the estimator in (4) would be Wishart distributed and the degrees of freedom would be the number of markers. The sample allele frequencies are not independent and only approximately normal, yet we use the likelihood

W(S/h^;Σ+B^/h^,df). (5)

The degrees of freedom, df, is adjusted to take into account the lack of independence. We estimate df using R bootstrapped replicates of S/h^ which we will denote X(1), …, X(R). Let X¯ be the average of the bootstrap samples. It would be natural to estimate the df with the maximum likelihood of the model

X(1),,X(R)W(X¯,df) (6)

However, simulations show that the estimates of df from (6) give results that are less accurate than the following moment-based estimator (see S14 Fig). We take advantage of the fact that the variance of the (k, l)’th entry of a Wishart distribution with mean Ψ/df and degrees of freedom, df, is

1df(Ψkl2+ΨkkΨll)

to estimate the df as

argmindfk=1nl=1n(Var^(Xkl(1),,Xkl(R))-1df(X¯kl2+X¯kkX¯ll))2 (7)

where Var^ is the sample variance. This moment-based estimator leads to better performance of AdmixtureBayes (S14 Fig).

In practice, to make the inference more robust to deviations from the prior, we normalize the matrices by using the likelihood

W(cSS/h^;cS(Σ+B^/h^),df) (8)

where cS=(log2(L)L+L)/tr(S/h^). For more on this, see the section “Robustness correction” in S1 Text.

Prior

We define a prior on the topology, G, and on the continuous parameters of the admixture graph. The continuous parameters include the branch lengths, c=(c1,,cD), and the admixture proportions w=(w1,,wK). Let K denote the number of admixture events, L the number of leaves, and D = 2L − 2 + 3K the number of branches. The full prior is then

P(G,c,w)=P(G|K)P(K)P(c|K)P(w|K).

The prior on the number of admixture events is a geometric distribution with parameter 0.5 (truncated to max 20). The prior on G, P(G|K), is a uniform prior on all labeled admixture graphs with K admixture events. To evaluate this prior, we need to calculate the number of possible topologies for a given number of admixture events. Therefore we have derived the recurrence formula

N(L,P,K,E)=2(E+1)N(L-1,P,K,E+1)+(L-2P+1)N(L-1,P-1,K,E)+(L+2P+3K-2E-2)N(L-1,P,K,E)+2(P+1)L(L+1)N(L+1,P+1,K-1,E-1)+4(P+1)(P+2)L(L+1)N(L+1,P+2,K-1,E)+4(P+1)(L-2P-1)L(L+1)N(L+1,P+1,K-1,E)+(L-2P)(L-2P+1)L(L+1)N(L+1,P,K-1,E)

where L is the number of leaves, P is the number of pairs of leaves that share a common parent, K is the number of admixture events, E is the number of eyes, and N(L, P, K, E) is the number of unique topologies with those attributes. Notice that we here allow eyes which otherwise are disallowed in our definition of admixture graphs. See S1 Text for proof. Then

P(G|K)=1P=0L/2N(L,P,K,0)

For the admixture proportion prior, P(w|K), we chose the uniform distribution on the interval (0, 1).

For the prior on the branch lengths, P(c|K), we chose to let all branch lengths be independent and marginally follow the distribution

ciExp(2L-2D),i=1,,D (9)

The rate of the exponential prior adapts to the topology such that graphs with many branches, and thereby many admixture events, are expected to have smaller branch lengths. For motivation, see the section on Robustness correction in S1 Text.

MCMC

The MCMC is implemented as a parallel Metropolis coupled MCMC algorithm [26] [27] to increase the number of jumps between local maxima of the posterior surface. Because admixture graphs with different number of admixture events also have different numbers of continuous parameters, we use the reversible jump generalization of the MCMC algorithm [28]. The proposal distribution is a mix of 7 smaller proposals. They are

  1. Add an admixture branch to the admixture graph. An admixture branch goes from a source branch to a sink branch (Fig 5). To make the proposal, a random sink branch, s, is chosen with probability 1D where D is the number of branches in the graph (not including the branch to the outgroup). Next, a random source branch, s′, is chosen from the remaining branches (including the root/outgroup branch) such that an addition of an admixture branch would not create a cycle in the graph. If the number of possible sink branches is D′(s), the probability of the sink position is 1D(s). Next the attachment point on the sink branch is simulated uniformly. If the branch lengths of s and s′ is c(s) and c(s′) the attachment outcome has density 1c(s)c(s). If the source branch is the root branch, we simulate the attachment point with an exponential distribution, Exp(1), instead. The new admixture proportion is simulated uniformly between 0 and 1, and the admixture branch length, s˜, is simulated from Exp(1) with density e-s˜. Lastly, the labeling of the two parent branches of the new admixture node is simulated. The probability of either possible labeling is 12. In conclusion, the density is
    1D1D(s)1c(s)1c(s)e-s˜12 (10)
    To find the acceptance probability of this proposal, we calculate the proposal probability of the reverse move (see proposal number 2). The reversible jump Jacobian factor is 1.
  2. Remove an admixture branch from the admixture graph. An admixture branch can be removed if 1) its parent is not an admixture node and 2) its removal will not cause an eye. Let the number of admixture branches eligible for removal be K′. We choose uniformly from that set and remove the admixture branch. The density is
    1K (11)
  3. Node sliding. A random branch whose parent is a divergence node is chosen. We move its attachment point to its source branch a distance λx where xχ2(1). A node can often be slid either up and down and sometimes the sliding node meets a bifurcation where it can slide in either of two directions. We choose the new node position uniformly from the set of the possible sliding destinations, following the topological constraints defined in step 1. If the sliding node slides out of the graph, we reject the proposal. The forward density is χ2(xλ)/ω where ω is the number of possible sliding destinations for a node when moved a distance x from its position in the current graph. We compute the backward density using the same formula. We update λ on-the-fly following guidelines for adaptive proposals in MCMC [29], eliminating the need for pre-analysis parameter tuning.

  4. Random walk on the branch lengths. We add a normally distributed noise to all the branch lengths. If any branch length becomes negative, we take its absolute value. This is known as a reflecting boundary random walk. The backward density is identical to the forward density. The variance of the random walk increments is controlled by parameter s which we also adapt on-the-fly using adaptive strategies.

  5. Resample admixture proportions. We sample each admixture proportion in the graph independently from the standard uniform distribution.

  6. Random walk on the branch to the outgroup as in step 4 but with another s-value. Negative proposed branch lengths are again reflected.

  7. Random walk on the branch lengths but inside the null space of matrix A. This means that the proposed admixture graph will have the same covariance matrix—and therefore the same likelihood—as the previous graph. This proposal is also adaptive, as in step 4.

Fig 5. When adding an admixture branch (green), we will randomly draw the branch where it comes from, the source branch (red).

Fig 5

The admixture branch goes into the sink branch (blue).

Graph summaries

In the Results section we explained the two summaries, minimal topology and consensus graph, which we will define formally here. Furthermore, we introduced the Set Distance used to measure distances between admixture graph topologies and the Covariance Distance for distances between admixture graphs. In this section, we define these quantities.

The Covariance Distance between two admixture graphs with L leaves and covariance matrices Σ and Σ′, respectively, is

i=1Lj=1L(Σij-Σij)2 (12)

For a single node let the descendant set be the the set of its leaf descendants, e.g. t = {l1, l2, …, la}. For a topology, let T be the topology set, which is the set of descendant sets of all its nodes, except the leaves and the root. This idea of a topology set is similar to the idea in the software package PhyloNet of considering the set of clusters induced by each edge of an evolutionary network [10] [12]. The minimal topology is the extension of such a topology set to a directed graph. The extension starts by adding the trivial descendant sets for the leaves (containing only one leaf) and the root (containing all the leaves). Denote this set T. The minimal topology has a node for each element of T and there is a connection from node tT to tT if

tt (13)
tt (14)

and

tT\{t,t}:ttt (15)

To summarize a sample of admixture graphs, g1, …, gR, using a consensus graph, we first transform all of them into their topology sets and obtain a sample T1, …, TR. The posterior probability of a node can be estimated by the sample frequency

f(t)=|{T{T1,,TR}:tT}|R

The topology set of the consensus graph at threshold α is

Tα={ti=1RTi:f(t)>α}. (16)

The consensus graph itself is obtained by extending Tα to a directed graph with the rules (13)(15).

The Set Distance between two graphs g1 and g2 with topology sets T1 and T2 is

|T1\T2|+|T2\T1| (17)

Running AdmixtureBayes, TreeMix, and OrientAGraph

As mentioned in the Results section, we simulated 20 datasets each from 4 distinct admixture graph models, and ran AdmixtureBayes, TreeMix, and OrientAGraph on each dataset. We compared the accuracy of each method using three metrics described in the previous section: the Topology Equality, Set Distance, and Covariance Distance. Comparing their accuracy is not straightforward because TreeMix and OrientAGraph produce one graph whereas AdmixtureBayes produces a posterior sample of graphs. In addition, TreeMix and OrientAGraph assume a fixed number of admixture events, whereas AdmixtureBayes samples graphs with different numbers of admixture events. TreeMix and OrientAGraph can estimate a maximum likelihood graph for a fixed number of admixture events, but the higher the number of admixture events, the higher the maximum likelihood value. Therefore, the original TreeMix paper suggests iteratively adding admixture events and stopping when the added admixture event does not pass a test for statistical significance. However, to simplify the comparison, we ran TreeMix and OrientAGraph with the true number of admixture events. We considered all graphs produced by AdmixtureBayes, even those with the wrong number of admixture events. We note that this could increase the error of AdmixtureBayes.

TreeMix and OrientAGraph first estimate an initial admixture-free tree by iteratively adding best fitting populations in a random procedure. Next, the admixture branches are added deterministically (although the exact method for adding branches differs between the two methods). Because of the randomness of the first step, the starting seed could influence the results. However, preliminary results showed that repeating the TreeMix maximum likelihood optimization for different seeds and choosing the highest likelihood graph amongst the repeated analyses did not change the accuracy of the estimated admixture graphs when analyzing our simulated datasets. Most seeds produced the same maximum likelihood graphs, an observation also found by Molloy et al. [3]. Therefore, we used only one seed for both TreeMix and OrientAGraph.

AdmixtureBayes was run on 20 datasets generated from msprime [18] from 4 distinct admixture graphs (see Fig 1). The Mode graph was chosen as the graph with the highest posterior out of all sampled graphs. Then, the first 35% of each chain was discarded as a burn-in and 100 equally spaced graphs were sampled from the resulting collection of graphs for use in the AdmixtureBayes Mean estimates. The exact code for running each of these analyses is in the SimulationStudy folder on the AdmixtureBayes GitHub.

Data

We analyzed a dataset consisting of SNPs for 12 human populations that was first analyzed by Moreno-Mayer et al. [17]. We treated the Yoruba population as an outgroup leaving effectively 11 populations with unknown relationships to estimate. One diploid individual was sampled from each population, except the Koryak, Ket, Greenlander and Athabascan populations, which each had two diploid individuals. Whole genome-sequencing was performed on each individual to provide an average coverage between 1X (for the Malta individual) to 44.2X (for one of the Greenlander individuals). Further details regarding sequencing and data processing methods are described in Moreno-Mayer et al. [17]. The alleles for the ancient individuals from the populations Saqqaq, Malta, Anzick and USR1 that were not transversions were treated as missing. We then filtered out any site for which there was a population with missing data. In total 251,542 biallelic SNPs were retained. Large numbers of missing SNPs for some individuals is not a computational problem for AdmixtureBayes, though it does violate the assumption of even sampling imposed by the Wishart distribution (see Methods, Eq (5)). Both the original VCF file and the allele count file used as input to AdmixtureBayes are available on the AdmixtureBayes GitHub.

Supporting information

S1 Fig. In all of our illustrations the direction of edges is from top to bottom, unless marked otherwise.

This multigraph topology has 4 leaves, 1 pair, 2 admixture nodes, 5 divergence nodes, and 1 eye. The root is the node at the very top of the topology. We have not explicitly labeled the nodes and branches.

(EPS)

S2 Fig. Representatives of the sets T3,1,1,0(left), U3,1,1,0(center) and S3,1,1,0(right).

In all of our illustrations on labeled or unlabeled topologies, the admixture edges in EA are marked with a dashed line. Here |S3,1,1,0|=2, |U3,1,1,0|=4 and |T3,1,1,0|=N(3,1,1,0)=12.

(EPS)

S3 Fig. Illustration of a shape, S1 (left), and the three unlabeled topologies corresponding to a shape S2 (S2 not explicitly drawn).

We have S1,S2S2,0,3,1, |US1|=2 and US2={U1,U2,U3}. Furthermore, the leaves of U1 and U2 are indistinguishable, while the leaves of U3 can be told apart. To see this, follow the path from the leaves to root; in U1 and U2 the path will only depend on whether the parent branch of the first encountered admixture is in EM or EA and not on the starting leaf. In contrast the starting leaf does matter for U3 so the leaves are distinguishable. Hence, |TU1|=|TU2|=1 but |TU3|=2.

(EPS)

S4 Fig. The two shapes of S(5,2,0,0), denoted S1 and S2 are illustrated above.

Here, US1={U1} and US2={U2}, because there are no admixture edges. Interestingly, the shape S2 exhibits more symmetry than the shape S1. To see this, let G1 and G2 be representatives of U1 and U2 with leaves labeled l1, l2, l3, l4, l5 from left to right. In both cases |TU1|=|TU2|=5!=120. The group HG1=e,(12),(34) has four elements and so |TU1|=120/4=30. The group HG2=e,(12),(34),(13)(24) has eight elements and so |TU2|=120/8=15. Altogether, N(5, 2, 0, 0) = 15 + 30 = 45. Notice that the leaves l4 and l5 form a pair in one fifth of the elements in both |TU1| and |TU2| although the two sets are of different size.

(EPS)

S5 Fig. Example graphs and their predecessors from each sub case 1.1)—2.4).

The graph ρ(G1.2) is the only labeled admixture graph that doesn’t have a predecessor, and the ultimate predecessor of every other graph.

(EPS)

S6 Fig. From the posterior AdmixtureBayes samples, we computed the posterior probability of all nodes.

The above graph is the smallest directed graph with all the nodes that have a posterior probability higher than 75%. Each internal node is colored according to its posterior probability, as described in Fig 3.

(EPS)

S7 Fig. From the posterior AdmixtureBayes sample, we computed the posterior probability of all minimal topologies for several subsets of the populations.

Here we show the three topologies with the highest posterior. The listed posterior for each graph represents the percentage of sampled graphs that have this minimal topology induced by the relevant set of nodes. For example, in 96% of graphs sampled, if only the leaves Athabascan, Koryak, and Saqqaq are considered, then all non-root and non-leaf nodes have a topology set of {Athabascan, Koryak}. The percentages in each node are the percentage of sampled graphs that have a node with the topology set implied by that node. For example, in 96% of graphs sampled, if only the leaves Athabascan, Koryak, Saqqaq are considered, then there is at least one node with the topology set {Athabascan, Koryak}. Whether a graph with this node belongs in the top left or bottom left box will depend on the presence or absence of a node with the topology set {Athabascan, Saqqaq}, which happens in 1% of all sampled graphs.

(EPS)

S8 Fig. Continuation of S7 Fig.

(EPS)

S9 Fig

The method used to calculate the Set Distance between two admixture graph topologies (left). First, the topologies are transformed in their descendant sets/topology sets (middle). The distance is then calculated as the symmetric set distance between the two topology sets (right).

(TIF)

S10 Fig. Examples of how the minimal topology is calculated.

First, we derive the topology set (middle) from the topology (left). The minimal topology (right) is the smallest possible graph that is consistent with the topology set. Note, node labels assigned to the topology (left) are arbitrary and do not identify corresponding nodes in the minimal topology (right).

(TIF)

S11 Fig. Here, we plot the trace plots for our simulated dataset.

Each chain is shown as a separate column. Each summary statistic is shown as a separate row.

(EPS)

S12 Fig. We plot the Gelman-Rubin convergence diagnostics on our simulated dataset for our three summary statistics after a burn-in fraction of 0.35.

A rapid convergence to 1 indicates that this is a sufficient burn-in period.

(EPS)

S13 Fig. We here show the autocorrelation plots for the summary statistics of our simulated data after a burn-in fraction of 0.35.

We only show the results for Chain 1 and do not include the number of admixture events as the autocorrelation shows strange behavior for discrete variables.

(EPS)

S14 Fig. We simulated admixture graphs with 10 leaves and 0, 1 and 2 admixture events.

Using these graphs, we simulated datasets using ms with different sample sizes. The top plot illustrates the ratio between the maximum likelihood degrees of freedom estimate from Eq (6) and the variance estimator in Eq (7). We ran AdmixtureBayes with the maximum likelihood estimate (MLE), the variance estimate (VAR), and 2 and 4 times the variance estimate (VARx2 and VARx4 respectively). We calculated the Mean Topology Equality, which was maximized when using the VAR estimates.

(EPS)

S15 Fig. We here plot the results of our simulations evaluating the effect of sampling small numbers of haplotypes.

Each boxplot represents the samples of posterior probabilities obtained by running AdmixtureBayes on 100 different datasets, each simulated from the same model with the underlying population history ((pop1,pop2),pop3). We plot a horizontal dashed line at 1/3, which would represent equal posterior probability at each topology. AdmixtureBayes does sample some admixture graphs that do have admixture events, but the 3 non-admixed topologies we list here account for more than 99% of the sampled admixture graphs in each case, so we simply ignore sampled topologies that do have admixture events. We see that increasing the number of sampled haplotypes causes the posterior probability to concentrate on the true graph, while AdmixtureBayes correctly models the uncertainty inherent to sampling fewer haplotypes.

(EPS)

S16 Fig. Here we plot the results obtained when running TreeMix and OrientAGraph on our dataset of Native American and Arctic populations.

We run each method with 3, 4, and 5 admixture events as these were the numbers of admixture events in nearly all graphs sampled by AdmixtureBayes after the burn-in period (see S11 Fig).

(TIF)

S1 Text. Supplementary information.

(PDF)

Acknowledgments

We thank the members of the Nielsen Lab for their helpful thoughts on both the manuscript and the AdmixtureBayes GitHub documentation. We also thank J. Víctor Moreno-Mayar for his helpful discussions on analyzing the real dataset.

Data Availability

All data are fully available without restriction at https://github.com/avaughn271/AdmixtureBayes.

Funding Statement

RN received grant R01GM138634, National Institutes of Health https://www.nih.gov. AHV received grant 2146752, National Science Foundation Graduate Research Fellowship Program https://www.nsfgrfp.org. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Patterson NJ, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, et al. Ancient Admixture in Human History. Genetics. 2012;192(3):1065–1093. doi: 10.1534/genetics.112.145037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Pickrell JK, Pritchard JK. Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data. PLOS Genetics. 2012;8(11):1–17. doi: 10.1371/journal.pgen.1002967 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Molloy EK, Durvasula A, Sankararaman S. Advancing admixture graph estimation via maximum likelihood network orientation. Bioinformatics. 2021;37(Supplement_1):i142–i150. doi: 10.1093/bioinformatics/btab267 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Lipson M, Loh PR, Levin A, Reich D, Patterson N, Berger B. Efficient moment-based inference of admixture parameters and sources of gene flow. Molecular biology and evolution. 2013;30(8):1788–1802. doi: 10.1093/molbev/mst099 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Yan J, Patterson N, Narasimhan VM. miqoGraph: fitting admixture graphs using mixed-integer quadratic optimization. Bioinformatics. 2020;37(16):2488–2490. doi: 10.1093/bioinformatics/btaa988 [DOI] [PubMed] [Google Scholar]
  • 6. Cavalli-Sforza LL, Edwards AW. Phylogenetic analysis: models and estimation procedures. Evolution. 1967;21(3):550–570. doi: 10.2307/2406616 [DOI] [PubMed] [Google Scholar]
  • 7. Coop G, Witonsky D, Di Rienzo A, Pritchard JK. Using environmental correlations to identify loci underlying local adaptation. Genetics. 2010;185(4):1411–1423. doi: 10.1534/genetics.110.114819 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Cheng JY, Stern AJ, Racimo F, Nielsen R. Detecting Selection in Multiple Populations by Modeling Ancestral Admixture Components. Molecular Biology and Evolution. 2021;39(1). doi: 10.1093/molbev/msab294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Gautier M. Genome-wide scan for adaptive divergence and association with population-specific covariates. Genetics. 2015;201(4):1555–1579. doi: 10.1534/genetics.115.181453 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Than C, Ruths D, Nakhleh L. PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC bioinformatics. 2008;9(1):322. doi: 10.1186/1471-2105-9-322 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Zhang C, Ogilvie HA, Drummond AJ, Stadler T. Bayesian inference of species networks from multilocus sequence data. Molecular biology and evolution. 2018;35(2):504–517. doi: 10.1093/molbev/msx307 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Wen D, Yu Y, Zhu J, Nakhleh L. Inferring phylogenetic networks using PhyloNet. Systematic biology. 2018;67(4):735–740. doi: 10.1093/sysbio/syy015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Yu Y, Nakhleh L. A maximum pseudo-likelihood approach for phylogenetic networks. BMC genomics. 2015;16(S10):S10. doi: 10.1186/1471-2164-16-S10-S10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Solís-Lemus C, Ané C. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS genetics. 2016;12(3). doi: 10.1371/journal.pgen.1005896 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Rogers J, Raveendran M, Harris RA, Mailund T, Leppälä K, Athanasiadis G, et al. The comparative genomics and complex population history of Papio baboons. Science Advances. 2019;5(1). doi: 10.1126/sciadv.aau6947 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Leppälä K, Nielsen SV, Mailund T. admixturegraph: an R package for admixture graph manipulation and fitting. Bioinformatics. 2017;33(11):1738–1740. doi: 10.1093/bioinformatics/btx048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Moreno-Mayar JV, Potter BA, Vinner L, Steinrücken M, Rasmussen S, Terhorst J, et al. Terminal Pleistocene Alaskan genome reveals first founding population of Native Americans. Nature. 2018;553(7687):203–207. doi: 10.1038/nature25173 [DOI] [PubMed] [Google Scholar]
  • 18. Kelleher J, Etheridge AM, McVean G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLOS Computational Biology. 2016;12(5):1–22. doi: 10.1371/journal.pcbi.1004842 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Wu Y. Inference of population admixture network from local gene genealogies: a coalescent-based maximum likelihood approach. Bioinformatics. 2020;36(Supplement_1):i326–i334. doi: 10.1093/bioinformatics/btaa465 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Friesen TM, Mason OK. The Oxford handbook of the prehistoric Arctic. Oxford University Press; 2016. [Google Scholar]
  • 21. Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, Ray N, et al. Reconstructing native American population history. Nature. 2012;488(7411):370–374. doi: 10.1038/nature11258 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Rasmussen M, Li Y, Lindgreen S, Pedersen JS, Albrechtsen A, Moltke I, et al. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature. 2010;463(7282):757–762. doi: 10.1038/nature08835 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Raghavan M, DeGiorgio M, Albrechtsen A, Moltke I, Skoglund P, Korneliussen TS, et al. The genetic prehistory of the New World Arctic. Science. 2014;345 (6200). doi: 10.1126/science.1255832 [DOI] [PubMed] [Google Scholar]
  • 24. Skoglund P, Reich D. A genomic view of the peopling of the Americas. Current Opinion in Genetics & Development. 2016;41:27–35. doi: 10.1016/j.gde.2016.06.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Flegontov P, Altınışık NE, Changmai P, Rohland N, Mallick S, Adamski N, et al. Palaeo-Eskimo genetic ancestry and the peopling of Chukotka and North America. Nature. 2019;570(7760):236–240. doi: 10.1038/s41586-019-1251-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Geyer CJ. Markov chain Monte Carlo maximum likelihood; 1991. Available from https://www.stat.umn.edu/geyer/f05/8931/c.pdf
  • 27. Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F. Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics. 2004;20(3):407–415. doi: 10.1093/bioinformatics/btg427 [DOI] [PubMed] [Google Scholar]
  • 28.Green PJ, Hastie DI. Reversible jump MCMC. Available from: http://people.ee.duke.edu/~lcarin/rjmcmc_20090613.pdf
  • 29. Andrieu C, Thoms J. A tutorial on adaptive MCMC. Statistics and Computing. 2008;18(4):343–373. doi: 10.1007/s11222-008-9110-y [DOI] [Google Scholar]

Decision Letter 0

David Balding, Sharon R Browning

7 Nov 2022

Dear Dr Vaughn,

Thank you very much for submitting your Methods entitled 'Bayesian inference of admixture graphs on Native American and Arctic populations' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Sharon R. Browning

Guest Editor

PLOS Genetics

David Balding

Section Editor

PLOS Genetics

The reviewers were all positive about this manuscript, but have suggestions for improving the work. We invite the authors to address as many of these suggestions as possible in a revised submission.  In preparing your revision please consider using the PLOS Genetics format for Methods papers, see https://journals.plos.org/plosgenetics/s/submission-guidelines#loc-manuscript-organization

Reviewer's Comments to the Authors:

Reviewer #1: See attached PDF.

Reviewer #2: S. Nielsen et al. have developed a novel method for characterizing admixture graphs. The authors adapt an existing allele frequency covariance approach and add MCMC statistical machinery that allows a more rich characterization of admixture graphs compared to existing methods that mostly focus on estimating a single best, admixture graph.

I agree that searching for the best topology, or a set of reasonable ones, for an admixture graph is a real problem, as existing methods are not particularly well suited to finding the set of high-probability admixture graphs. These methods often require extensive direct input and subjective judgment, making them difficult to apply in a consistent manner. The authors also present a small suite of topological-based methods for summarizing admixture graphs that are potentially useful moving forward.

The authors apply the new method to simulated data, compare the results with an existing related method, Treemix, and apply the method to a set of 11 population samples with a goal of better understanding the genetic relationship between Koryak, Saqqaq, Athabascans, and native Greenlanders.

It is a substantial amount of work to develop a MCMC procedure to efficiently sample the huge space of possible admixture graphs. The authors present a set of 7 proposals for alterations to graphs that seem to be able to move between possible graphs in a way that respects detailed balance. It is difficult to evaluate the extent that the full space of possible graphs is explored, but I will assume it is. Towards this goal, the authors use an adaptive Metropolis-Coupled MCMC that attempts to better explore multimodal distributions.

While I found the text generally clear, I found the structure and flow of the manuscript challenging. PLOS Genetics prescribes a Results/Discussion/Methods format. I understand this structure has the potential to make it difficult to find the best way to present Methods-focused papers, but I think the manuscript would benefit from reorganization.

For example, the results section starts with an evaluation of MCMC chains, while this is an important aspect of conducting any MCMC analysis, it is not really the main focus of the current study, and no results are actually presented in the text. It is followed by a comparison to Treemix section that spends 1.5 pages describing the methods for the comparison. Again this is important, but should not precede the basic description of AdmixtureBayes. Main figures 3, 4, 5, 6, 7 are only referenced in the appendix, and all figures applying the method to data (simulated or real) are supplemental figures. This is out of step with the title that highlights applying this method to real data. “... on Native American and Arctic populations”.

I suggest the Results section should cover the following topics in order - 1) Method introduction, justification and important details, 2) Evaluation on simulated data / comparison to other methods 3) Application to real data. This would better match other papers in the field, such as the Treemix paper, also published in PLOS Genetics. This also better matches the alternative formatting for Methods papers at: https://journals.plos.org/plosgenetics/s/submission-guidelines.

The Methods section can supplement this formatting, by starting with e.g., 1) a full description of the method, 2) Description of simulations / comparisons to other methods, 3) real data sets. Self-contained details of specific aspects can be moved to an appendix or supplement.

I was able to download and install AdmixtureBayes from the provided github link and to run an analysis on the provided example data file without any problems. My strong suggestion for the software would be to utilize a common data format, such as VCF, rather than a file format unique to a particular program. Also I might suggest splitting step 1 in the analysis outline into two steps, 1) estimation of the population covariance matrix and 2) run the MCMC. I did not see any built-in way to estimate or examine the covariance matrix.

Small issues:

The authors show how small numbers of sampled haplotypes can reduce the accuracy of estimated admixture graph topology (figure S2). However, it was not immediately clear if AdmixtureBayes results accurately characterize the uncertainty due to sampling few haplotypes. This seems crucial to interpreting the results applying AdmixtureBayes to real data, such as in the current manuscript.

Due to a lack of background, I was not able to adequately review the topological arguments present in Appendix A.3, (the justification of the flat prior on the topology). This could be split off into a separate manuscript to be submitted to another journal such as Theoretical Population Biology, or another appropriate journal where it could stand on its own and get a separate review.

The application of AdmixtureBayes to the real data set and the accompanying discussion did a good job highlighting how often researchers have specific questions, such as “is there admixture present?”, “what is the source of admixture in this population?”, etc. As a suggestion it could be useful to examine different types of relationships between populations, and how well they are characterized by AdmixtureBayes. Currently this is all done in figure S3, using the mode set distance statistic, which is not so easy to see how to interpret biologically.

It was difficult to find the number of simulation and analysis replicates that the simulation and Treemix comparison results were based on.

The code for the two simulation evaluations are nice to include in the manuscript, but a higher level explanation of the demography, as well as a statement of why this is an appropriate demography / admixture graph would aid understanding of the results.

Page 4- “It has previously been used in several other methods aimed at modeling the joint distribution of allele frequencies among populations” please support with citations.

Page 7 - “We then ran AdmixtureBayes for three different chains, each with --MCMC chains 16” is this 3, 16 or 48 chains?

Page 19 - Admixture graphs are described here, and this could be useful if the rest of the paper utilizes this description. However, it is not clear how to match the description here with the example admixture graph present in Figure 1. This description states “There exists one and only one root. That is a node with no parents and exactly two children”. Yet, to my eye, Figure 1 contains zero nodes with no parents and two children.

Page 35 - It would help to provide a figure of the admixture graph that is simulated by this code.

Figures

In general I found the figures under-described, with many explanations not repeated on each relevant figure. Many main figures were not referenced in the main text, but I appreciated the figure representing topological concepts. In general, I thought the supplemental figure were better suited for main, while most main figure were better suited to the supplement.

S1 - Should be moved to main. Reorder the subplots so similar statistics are next to each other - eg . (mode, mean) Topology equality, (mode mean) set distance.

S3 - why was only this statistic out of the 4 based on topology selected to be shown.

S4 - why are the two subfigures different sizes? Node labeling could be improved as it is difficult to distinguish the support from the node label. Please explain coloring in the figure legend.

S6 and S7 I think these plots are good examples of the additional utility possible with AdmixtureBayes. However in this presentation I did not understand how the posterior of the interior nodes combined with the posterior for the topology e.g. the ((Koryak, Saqqaq), Athabascan) tree.

Reviewer #3: Manuscript Review

PGENETICS-D-22-01018

“Bayesian inference of admixture graphs on Native American and Arctic populations"

In this study, the authors evaluated the ability of admixture graphs to reconstruct population relationship in the Americas based on genome data. They present a new reversible jump MCMC algorithm for sampling high-probability admixture graphs and show that this approach works well both as a heuristic search for a single best-fitting graph and for summarizing shared features extracted from posterior samples of graphs. They subsequently used this method with data from 11 Native American and Siberian populations to address the relationship between Saqqaq, Inuit, Koryaks, and Athabascans. In contrast with previous work, they find that “the Saqqaq is not a good proxy for the previously identified gene flow from Arctic people into the Na-Dene speaking Athabascans.”

From my assessment of the admixture graph analysis, the authors’ analysis seems robust, and has been directed at issues concerning optimality and likelihood issues arising from the use of other admixture graphing methods commonly employed in population genetics studies.

Yet, having followed the development of admixture graphs over a number of years, I would like to pose some questions to the authors about their use for modeling the peopling of the Americas based on ancient and modern genome data.

From an admixture standpoint, how accurately can a single genome represent a “population”? Can the Saqqaq individual really represent an entire ancestral group? What is the consequence of adding more individuals per population when estimating admixture graphs? Can this effect be modeled in order to determine how population level data (e.g., 20-30 individuals) will affect the accuracy of these estimates? How much certainty can we place on the results of admixture analysis with ancient and modern individuals when data from key ancient populations are lacking, i.e., those ancestral to modern groups which now occupy certain regions of northeast Asia or North America?

These are germane questions for efforts to reconstruct the peopling of the Americas, when initial population movements into North America may have been pulsatile in nature and those occurring in Beringia during the latter stages of the LGM may have been quite dynamic.

On this same note, modeling the relationships between circumarcric populations is not necessarily easy. Greenlanders are not precisely the same group as Alaskan Inuit, Yupik, Aluttiq or Aleut populations based on language and culture, which took hundreds if not thousands of years to evolve. The same thing is true to some extent with genetic data, which suggests subtle differences between Eskimoan-speaking groups and more substantial ones between these groups and Aleuts and Athapaskans (Na-Dene). The Athapaskans used in this study also represent one only group of Na-Dene speakers whose homeland may be Southeast Alaska and from which they spread into what is now Alaska and Canada.

In this regard, the authors seem to indicate that they consider Athapaskans to be “Native Americans” inclusive of Amerindians (the first wave of settlers). Is this the case or is the term “Native American” being used here simply to distinguish between Eskaleut-speaking groups and all other indigenous population of the Americas?

I do find it intriguing that the authors’ analysis suggests that that Athapaskans are best represented as “the result of admixture between a Native American population and a Siberian population most closely related to the Koryak, but not the Saqqaq.” However, previous studies have produced Y-chromosome data linking circumarctic groups including the Saqqaq and Koryaks, and the Saqqaq also show affinities with the Nganasan, Koryak, and Chukchi of northeastern Siberia based on autosomal data. In light of these findings, how can we explain the admixture results presented by the authors?

While recognizing that this manuscript has been submitted to a genetics paper, it is important to reconcile genetic admixture studies such as this with other archaeological, ethnographic, genetic and linguistic evidence to ‘ground truth’ the results of the modeling work. For this reason, I would be most interested to know the authors’ perspective on these issues.

Reviewer #4: Here, Nielsen et al (2021) present a Bayesian method for estimating admixture graphs from allele frequency data. Their approach AdmixtureBayes is evaluated on simulated and real data sets. The results on simulated data sets are promising, and the results on biological data (on "peopling of the Americas") is quite interesting as well (although I am not a domain expert). Overall, this paper makes a strong contribution to the literature on admixture graphs, and I expect the AdmixtureGraph methodology to be of interest to method developers and method users alike.

Nevertheless, this paper could be improved in several ways. I make the following recommendations.

(1) The biological data set should be made publicly available (specifically the authors should provide the VCF and allele frequencies rather than just the accession IDs for genomes).

(2) The comparison study should include an additional method (their current comparison is against TreeMix but a new method OrientAGraph has been shown to improve the topological accuracy of TreeMix).

(3) There are some figures that should be updated so that readers can more quickly understand the differences between running TreeMix and AdmixtureGraph on simulated and biological data sets.

Details are below.

#1. Data Availability

---

This is a major area of research with new methods being developed each year. A limiting factor to comparing methods and developing new methods is that many studies do not publish their data in a usable form. Often only acession IDs for genomes are provided, even though the developed methods require VCFs or allele frequencies as input.

+1a. It looks like Nielsen et al (2021) currently only provide a link to the original Nature paper (which in turn only provides accession IDs). I strongly recommend that Nielsen et al (2021) to make their processed data (i.e., VCF and allele frequencies) publically available. This will make their results reprocible and will help future studies, increasing the impact of this paper. (of all of my comments, this is the one that I feel is the most important to address)

+1b. It's great that Nielsen et al (2021) make the msprime commands available in the paper so that future researchers can benchmark methods on the simulated data sets related to this study (they go above and beyond and provide the ms command as well). I encourage the authors to upload the simulated data sets themselves to Github or some other platform.

#2. Discussion of and comparison to existing methods

---

+2a. Nielsen et al (2021) discuss existing admixture graphs methods on pages 3-4, focusing on well-established methods, like TreeMix (2012), qpGraph (2012), and MixMapper (2013). I recommend discussing some more recently developed methods, like miqoGraph (2021) and OrientAGraph (2021), in addition.

+2b. Nielsen et al (2021) write on page 4 "TreeMix searches through many potential admixture graphs without user input." I recommend changing this sentence to read something like "TreeMix searches through potential admixture graphs without user input by way of an efficient greedy heuristic." This would tie into the comments on page 4 about the downsides of greedy algorithms (where they fail to effectively search network space, getting trapped in local optima).

+2c. I recommend that Nielsen et al (2021) include the command used to run TreeMix on page 37ish. It could also be interesting to note that their observation (that changing the seed doesn't impact the graph topology recovered by TreeMix very much) was also reported by Molloy et al (2021), although these authors varied the population addition order explicitly rather than the seed.

+2d. I recommend that Nielsen et al (2021) add OrientAGraph to their comparison study because OrientAGraph has been demonstrated to produce graphs that are as accurate as those produced by TreeMix or else more accurate. This comparison should be relatively easy to add because OrientAGraph was shown to be negligibly slower than TreeMix on a data set with 10 populations and 2 admixture graphs. Moreover, OrientAGraph takes the same input as TreeMix (allele frequencies) and has nearly the same command. The only difference is the addition of -mlno 1,2 and -allmigs 1,2 flags (where 1,2 indicate that the admixture graph topology search should be expanded with the MLNO and ALLMIG algorithms after adding the first gene flow edge and after adding the second gene flow edge).

+2e. It would be great to add a comparison to miqograph as well but this method is more difficult to run in my experience because it relies on Gurobi and the output of the admixture graph is in a non-standard format.

#3. Admixture Bayes Method

---

+3a. Nielsen et al (2021) write on page 3, "We here improve on these approaches by developing a novel MCMC sampling method, AdmixtureBayes, that can sample from the posterior distribution of admixture graphs. This enables an efficient search of the entire state space a well as the ability to report a level of confidence in the sampled graphs." I recommend replacing "efficient" with "effective" or some other adjective because this study at present doesn't emphasize efficiency. In particular, metrics like runtime are not currently reported for all methods / data sets studied. In addition, the authors do not report the number of admixture graph topologies explored by the different methods. My guess is that the MCMC method is less efficient because the runtime on the biological data set was 50 hours (my guess is that TreeMix would take just a few minutes on this data set discounting the time to estimate summary statistics from allele frequencies).

+3b. I found the description of AdmixtureBayes to be well-written and well-motivated. My only questions was how the MCMC begins. The proposals must be performed on some starting admixture graph (both topology and numerical parameters), so I was curious whether this graph was selected at random or constructed using some heuristic, or something in between.

+3c. Nielsen et al (2021) define the concepts of minimal topology and consensus graphs, along with tools for computing and visualizing them. It would be great could if the authors discuss how these ideas compare to the techniques used by existing methods (like PhyloNet), which I believe also produce summarizes of network space.

#5. Results

---

+5a. I found Figure S1/S2/S3/S14 to be somewhat difficult to interpret because only a single value is reported for multiple data sets. I recommend Nielsen et al (2021) re-plot these figures as boxplots (with data points plotted over the boxes) or as violin plots. It could also be helpful to plot topological error for TreeMix (e.g. set distance between true graph and estimated graph) against error for AdmixtureBayes (e.g., set distance between true graph and MAP graph). This way would cleary show when AdmixtureGraph is outperforming TreeMix for each data set. Alternatively, the difference between methods on a given data set could be plotted as box plot.

+5b. Currently, the authors provide a text description of differences between prior analyses on the "peopling of the Americas" and the results of running AdmixtureBayes. It would be very helpful to show this information as a figure. For example, it would be great if Nielsen et al (2021) provided the result of running TreeMix as part of Figure S4, with the same node labelings.

+5c. I recommend the authors compute likelihood scores for the graphs shown in Figures S4, S5, and S6 and include this information in the figure (it seems like numerical parameters could simply be reoptimized for the given topologies and then the likelihood scores reported). This information would help readers compare the results to prior studies where likelihood scores have been reported. It could also emphasize points being made by Nielsen et al (2021) about the values of Bayesian methods compared to ML methods in this context.

+5d. I recommend Nielsen et al (2021) provide the specific commands used for running TreeMix and AdmixtureBayes be provided (or provide the analysis scripts).

#6. Links:

---

miqograph - https://doi.org/10.1093/bioinformatics/btaa988

OrientAGraph - https://doi.org/10.1093/bioinformatics/btab267

PhyloNet - https://doi.org/10.1093/sysbio/syy015

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: No: See my review

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

Attachment

Submitted filename: report.pdf

Decision Letter 1

David Balding, Sharon R Browning

23 Jan 2023

Dear Dr Vaughn,

We are pleased to inform you that your manuscript entitled "Bayesian inference of admixture graphs on Native American and Arctic populations" has been editorially accepted for publication in PLOS Genetics. Congratulations! In your final submission, please address the four minor points raised by reviewers 2 and 4.

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Sharon R. Browning

Guest Editor

PLOS Genetics

David Balding

Section Editor

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Reviewer's Comments to the Authors:

Reviewer #1: All of my comments have been thoroughly addressed. I think the paper makes an important and practical methodological contribution, and warrants publication.

Reviewer #2: S. Nielsen and coauthors have substantially updated their manuscript and addressed many of the reviewers first round issues. Specifically, they have reorganized the main and supplemental texts, reworked the analysis of simulated data, compared to a wider array of alternative methods, updated the MCMC proposal, and made a number of smaller changes to the text and computer program.

I have just two very small comments on the revised version.

In Figure 2 the text labels are too small. In addition, in my opinion the scale is off - any moderate differences between methods are squashed by the figure design.

I appreciate that the authors added further investigations of the effect of smaller sample sizes, even adding a new section to address it specifically (Section A.4). However, I disagree with one specific statement the authors make in that section. The authors state: “we observe that while the true topology is indeed the one inferred by Admixture-

Bayes to have the highest posterior in both cases, the simulated datasets with 40

haplotypes generate a probability distribution that is much more concentrated

on the true admixture graph. We therefore conclude that AdmixtureBayes correctly characterizes uncertainty due to sampling small numbers of haplotypes

from populations.”

However the authors never actually evaluate if this uncertainty is accurately quantified, instead they simply observe that it is increased with smaller sample sizes, consistent with what is expected. I suggest that the authors should simply note this consistency and remove the claim that the uncertainty is accurately characterized.

Reviewer #4: I was very positive about the paper during the first round of review, and all of my comments (about method comparisons, evaluation metrics, and supporting information) have been addressed (for reference I was reviewer #4). I found the new results in the revised paper quite interesting. The simulation study shows that AdmixtureBayes is effective in reconstructing admixture scenarios were existing methods fail (note the results of the biological data set are interesting but I am not a domain expert here). My opinion is the the AdmixtureBayes method will be of great interest to may readers of PLOS Genetics.

Minor comments:

---------------

Potential typo on page 6: "1 of these nodes is called the outgroup and has exactly one child, the root." This makes sense to me from the phylogenetics network literature and in the context of the model, but it could be confusing to some readers looking at Figure 1 (where the outgroup is a leaf vertex that has the root as its parent).

Potential typo on page 29: AdmixtureGraph GitHub -> AdmixtureBayes Github?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #4: No

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-22-01018R1

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

David Balding, Sharon R Browning

7 Feb 2023

PGENETICS-D-22-01018R1

Bayesian inference of admixture graphs on Native American and Arctic populations

Dear Dr Vaughn,

We are pleased to inform you that your manuscript entitled "Bayesian inference of admixture graphs on Native American and Arctic populations" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Timea Kemeri-Szekernyes

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. In all of our illustrations the direction of edges is from top to bottom, unless marked otherwise.

    This multigraph topology has 4 leaves, 1 pair, 2 admixture nodes, 5 divergence nodes, and 1 eye. The root is the node at the very top of the topology. We have not explicitly labeled the nodes and branches.

    (EPS)

    S2 Fig. Representatives of the sets T3,1,1,0(left), U3,1,1,0(center) and S3,1,1,0(right).

    In all of our illustrations on labeled or unlabeled topologies, the admixture edges in EA are marked with a dashed line. Here |S3,1,1,0|=2, |U3,1,1,0|=4 and |T3,1,1,0|=N(3,1,1,0)=12.

    (EPS)

    S3 Fig. Illustration of a shape, S1 (left), and the three unlabeled topologies corresponding to a shape S2 (S2 not explicitly drawn).

    We have S1,S2S2,0,3,1, |US1|=2 and US2={U1,U2,U3}. Furthermore, the leaves of U1 and U2 are indistinguishable, while the leaves of U3 can be told apart. To see this, follow the path from the leaves to root; in U1 and U2 the path will only depend on whether the parent branch of the first encountered admixture is in EM or EA and not on the starting leaf. In contrast the starting leaf does matter for U3 so the leaves are distinguishable. Hence, |TU1|=|TU2|=1 but |TU3|=2.

    (EPS)

    S4 Fig. The two shapes of S(5,2,0,0), denoted S1 and S2 are illustrated above.

    Here, US1={U1} and US2={U2}, because there are no admixture edges. Interestingly, the shape S2 exhibits more symmetry than the shape S1. To see this, let G1 and G2 be representatives of U1 and U2 with leaves labeled l1, l2, l3, l4, l5 from left to right. In both cases |TU1|=|TU2|=5!=120. The group HG1=e,(12),(34) has four elements and so |TU1|=120/4=30. The group HG2=e,(12),(34),(13)(24) has eight elements and so |TU2|=120/8=15. Altogether, N(5, 2, 0, 0) = 15 + 30 = 45. Notice that the leaves l4 and l5 form a pair in one fifth of the elements in both |TU1| and |TU2| although the two sets are of different size.

    (EPS)

    S5 Fig. Example graphs and their predecessors from each sub case 1.1)—2.4).

    The graph ρ(G1.2) is the only labeled admixture graph that doesn’t have a predecessor, and the ultimate predecessor of every other graph.

    (EPS)

    S6 Fig. From the posterior AdmixtureBayes samples, we computed the posterior probability of all nodes.

    The above graph is the smallest directed graph with all the nodes that have a posterior probability higher than 75%. Each internal node is colored according to its posterior probability, as described in Fig 3.

    (EPS)

    S7 Fig. From the posterior AdmixtureBayes sample, we computed the posterior probability of all minimal topologies for several subsets of the populations.

    Here we show the three topologies with the highest posterior. The listed posterior for each graph represents the percentage of sampled graphs that have this minimal topology induced by the relevant set of nodes. For example, in 96% of graphs sampled, if only the leaves Athabascan, Koryak, and Saqqaq are considered, then all non-root and non-leaf nodes have a topology set of {Athabascan, Koryak}. The percentages in each node are the percentage of sampled graphs that have a node with the topology set implied by that node. For example, in 96% of graphs sampled, if only the leaves Athabascan, Koryak, Saqqaq are considered, then there is at least one node with the topology set {Athabascan, Koryak}. Whether a graph with this node belongs in the top left or bottom left box will depend on the presence or absence of a node with the topology set {Athabascan, Saqqaq}, which happens in 1% of all sampled graphs.

    (EPS)

    S8 Fig. Continuation of S7 Fig.

    (EPS)

    S9 Fig

    The method used to calculate the Set Distance between two admixture graph topologies (left). First, the topologies are transformed in their descendant sets/topology sets (middle). The distance is then calculated as the symmetric set distance between the two topology sets (right).

    (TIF)

    S10 Fig. Examples of how the minimal topology is calculated.

    First, we derive the topology set (middle) from the topology (left). The minimal topology (right) is the smallest possible graph that is consistent with the topology set. Note, node labels assigned to the topology (left) are arbitrary and do not identify corresponding nodes in the minimal topology (right).

    (TIF)

    S11 Fig. Here, we plot the trace plots for our simulated dataset.

    Each chain is shown as a separate column. Each summary statistic is shown as a separate row.

    (EPS)

    S12 Fig. We plot the Gelman-Rubin convergence diagnostics on our simulated dataset for our three summary statistics after a burn-in fraction of 0.35.

    A rapid convergence to 1 indicates that this is a sufficient burn-in period.

    (EPS)

    S13 Fig. We here show the autocorrelation plots for the summary statistics of our simulated data after a burn-in fraction of 0.35.

    We only show the results for Chain 1 and do not include the number of admixture events as the autocorrelation shows strange behavior for discrete variables.

    (EPS)

    S14 Fig. We simulated admixture graphs with 10 leaves and 0, 1 and 2 admixture events.

    Using these graphs, we simulated datasets using ms with different sample sizes. The top plot illustrates the ratio between the maximum likelihood degrees of freedom estimate from Eq (6) and the variance estimator in Eq (7). We ran AdmixtureBayes with the maximum likelihood estimate (MLE), the variance estimate (VAR), and 2 and 4 times the variance estimate (VARx2 and VARx4 respectively). We calculated the Mean Topology Equality, which was maximized when using the VAR estimates.

    (EPS)

    S15 Fig. We here plot the results of our simulations evaluating the effect of sampling small numbers of haplotypes.

    Each boxplot represents the samples of posterior probabilities obtained by running AdmixtureBayes on 100 different datasets, each simulated from the same model with the underlying population history ((pop1,pop2),pop3). We plot a horizontal dashed line at 1/3, which would represent equal posterior probability at each topology. AdmixtureBayes does sample some admixture graphs that do have admixture events, but the 3 non-admixed topologies we list here account for more than 99% of the sampled admixture graphs in each case, so we simply ignore sampled topologies that do have admixture events. We see that increasing the number of sampled haplotypes causes the posterior probability to concentrate on the true graph, while AdmixtureBayes correctly models the uncertainty inherent to sampling fewer haplotypes.

    (EPS)

    S16 Fig. Here we plot the results obtained when running TreeMix and OrientAGraph on our dataset of Native American and Arctic populations.

    We run each method with 3, 4, and 5 admixture events as these were the numbers of admixture events in nearly all graphs sampled by AdmixtureBayes after the burn-in period (see S11 Fig).

    (TIF)

    S1 Text. Supplementary information.

    (PDF)

    Attachment

    Submitted filename: report.pdf

    Attachment

    Submitted filename: Admixturebayes Reviews.pdf

    Data Availability Statement

    All data are fully available without restriction at https://github.com/avaughn271/AdmixtureBayes.


    Articles from PLOS Genetics are provided here courtesy of PLOS

    RESOURCES