Estimating the timing of multiple admixture events using 3-locus linkage disequilibrium

Mason Liang; Mikhail Shishkin; Anastasia Mikhailova; Vladimir Shchur; Rasmus Nielsen

doi:10.1371/journal.pgen.1010281

. 2022 Jul 15;18(7):e1010281. doi: 10.1371/journal.pgen.1010281

Estimating the timing of multiple admixture events using 3-locus linkage disequilibrium

Mason Liang ^1,^#, Mikhail Shishkin ^2,^#, Anastasia Mikhailova ², Vladimir Shchur ^2,^*, Rasmus Nielsen ^1,^3,^*

Editor: Garrett Hellenthal⁴

PMCID: PMC9342778 PMID: 35839249

Abstract

Estimating admixture histories is crucial for understanding the genetic diversity we see in present-day populations. Allele frequency or phylogeny-based methods are excellent for inferring the existence of admixture or its proportions. However, to estimate admixture times, spatial information from admixed chromosomes of local ancestry or the decay of admixture linkage disequilibrium (ALD) is used. One popular method, implemented in the programs ALDER and ROLLOFF, uses two-locus ALD to infer the time of a single admixture event, but is only able to estimate the time of the most recent admixture event based on this summary statistic. To address this limitation, we derive analytical expressions for the expected ALD in a three-locus system and provide a new statistical method based on these results that is able to resolve more complicated admixture histories. Using simulations, we evaluate the performance of this method on a range of different admixture histories. As an example, we apply the method to the Colombian and Mexican samples from the 1000 Genomes project. The implementation of our method is available at https://github.com/Genomics-HSE/LaNeta.

Author summary

We establish a theoretical framework to model 3-locus admixture linkage disequilibrium of an admixed population taking into account the effects of genetic drift, migration and recombination. The theory is used to develop a method for estimating the times of multiple admixtures events. We demonstrate the accuracy of the method on simulated data and we apply it to previously published data from Mexican and Colombian populations to explore the complex history of American populations in the post-Colombian period.

Introduction

There are many methods for inferring the presence of admixture, e.g. methods using simple summary statistics detecting deviations from phylogenetic symmetry [1–3] and methods estimating admixture proportions using programs such as Structure [4], Admixture [5] or RFmix [6]. There has also been substantial research on estimating admixture times. Some approaches are based on inferring admixture tract length distributions, such as [7–12]. Over time, recombination is expected to decrease the average lengths of admixture tracts. The length distribution of admixture tracts is therefore informative about the time since admixture. Much of the theory relating to tracts lengths is based on Fisher’s famous theory of junctions [13] and subsequent work, such as [14–23]. For example, [24] first discussed the length distribution of tracts descended from a single ancestor. These results informed later analyses of admixture tract length distribution, such as references [7–9]. Gravel [8] also implemented the software program TRACTS, which estimates admixture histories by fitting the tract length distribution, obtained by local ancestry inference, to a exponential approximation.

Another approach, which we will follow in this paper, is based on the decay of admixture linkage disequilibrium (ALD). Linkage disequilibrium exists in any natural population due to mutation and genetic drift. However, in well-mixed and genetically isolated populations with recombination it usually decays quite rapidly at a genomic scale. For example, in many human populations linkage disequilibrium decays to approx. zero in less than 1 Mb. However, admixture tracts introduced into a population in an admixture event generates ALD over much longer distances, even if the amount of LD in the source populations is negligible. After a single admixture event, linkage disequilibrium in the admixed population will then gradually decrease in the subsequent generations as a result of recombination. It is, therefore, possible to make inferences about the admixture history of a population from the patterns of LD present in the population. This insight was first used in the program ROLLOFF [25] and was later extended by ALDER [26]. These two methods use the fact that if an admixed population takes in no additional migrants after the founding generation, the ALD present in the population is expected to decay approximately exponentially as a function of distance. The rate constant of this exponential decay is proportional to the age of the founding admixture pulse and can be used as an estimator. ROLLOFF and ALDER are well suited for inferring the time of the admixture event when the admixture history of the population can be approximated as a single pulse. However, in many realistic scenarios the admixture histories involve multiple pulses. Prominent examples in humans include Native American admixture in Rapa Nui [27] or admixed population groups in the Americas [28]. In these instances the expected decay of LD will become a mixture of exponentials. Existing dating method based on ALD can usually only infer the date of the most recent migration wave [25], or reject the hypothesis of a single pulse admixture [26].

ROLLOFF and ALDER use the information contained in pairs of sites by examining the two-locus linkage disequilibrium between them. Here we extend the theory underlying the methods in ROLLOFF and ADLER to three loci by considering three-locus LD. There are two ways of measuring the linkage between n loci. Bennett [29] defines n-locus linkage in a way that maintains a geometric decrease of LD each generation as a result of recombination, which is an important property of two-locus linkage disequilibrium. Slatkin [30] defines n-locus LD to be the n-way covariance, analogously to the property of two locus LD as the covariance in allele frequency between pairs of loci. For two and three loci, these two definitions coincide, but for four or more loci, they do not. Another method GLOBETROTTER [31] uses a copying model in a way similar to ROLLOFF. First, shared haplotype chunks are inferred with CHROMOPAINTER [32], then a mixture of exponentials is used to fit the coancestry curve.

In this paper, we will use Bennett and Slatkin’s definition of three-locus LD to examine the decay of ALD for three sites as a function of the genetic distance between them. We derive an equation that describes the decay of three-locus LD under an admixture history with multiple waves of migration from two source populations. We then compare the results of coalescent simulations to this equation, and develop some guidelines for when admixture histories more complex than a single pulse can be resolved using ALD. Finally, we apply our method to the Colombian and Mexican samples in the 1000 Genomes data set, using the Yoruba samples as a reference. Fitting a two-pulse model to data, we estimate admixture histories for the two populations which are qualitatively consistent with the results reported in [28].

Description of the method

Model

We use a random union of gametes admixture model as described in [33], which is an extension of the mechanistic admixture model formulated by [34]. In this model, two or more source populations contribute migrants to form an admixed population consisting of 2N haploid individuals. Each generation in the admixed population is formed through the recombination of randomly selected individuals from the previous generation, with some individuals potentially replaced by migrants from the source populations. For simplicity, we consider a model with only two source populations. Furthermore, the first source population only contributes migrants in the founding generation, T. The second source population contributes migrants in the founding generation and possibly in one or more generations thereafter. In generation i, for i = T − 1, …, 0 (before the present), a fraction m_i of the admixed population is replaced by individuals from the second source population.

Linkage disequilibrium and local ancestry

ROLLOFF and ALDER use the standard two-locus measure of LD between a SNP at positions x and another SNP at position y, which is a genetic distance d to the right,

\begin{matrix} D_{2} (d) = cov (H_{x}, H_{y}), \end{matrix}

(1)

where H_x and H_y represent the haplotype or genotypes of an admixed chromosome at positions x and y. In the case of haplotype data, H_i,x = 1 if the i^th sample is carrying the derived allele at the SNP at position x, and is otherwise 0. Alternatively, for genotype data, H_i,x take on values from {0, 1/2, 1} depending on the number of copies of the derived allele the i^th sample is carrying at SNP position x. We consider an additional site at position z, which is located a further genetic distance d′ to the right of y. The three-loci LD, as defined by [29] and [30], is given by

\begin{matrix} D_{3} (d, d^{'}) = cov (H_{x}, H_{y}, H_{z}) = E [(H_{x} - E H_{x}) (H_{y} - E H_{y}) (H_{z} - E H_{z})] . \end{matrix}

(2)

The LD in an admixed population depends on the genetic differentiation between the source populations and and its admixture history. Let A_x represent the local ancestry at position x, with A_x = 0 if x is inherited from an ancestor in the second source population (the one which contributed in two admixture events), and A_x = 1 if x is inherited from the first source population (the one which contributed in a single admixture event). We can compute D₃ in terms of the three-point covariance function of A_x and so separate out the effects of allele frequencies and local ancestry. Consider the conditional expectation $E (H_{x} | A_{x}) = g_{x} + δ_{x} A_{x}$ , where g_x is the allele frequency of locus x in the second source population and δ_x = f_x − g_x is the difference of the allele frequencies of locus x in the two source populations. We now make the assumption that the allele frequencies in the source populations are known and fixed. Our goal is to prove that

\begin{matrix} D_{3} (d, d^{'}) = cov (H_{x}, H_{y}, H_{z}) = δ_{x} δ_{y} δ_{z} cov (A_{x}, A_{y}, A_{z}) \end{matrix}

(3)

By taking expectation, we obtain

E (H_{x}) = E (E (H_{x} | A_{x})) = g_{x} + δ_{x} E (A_{x}) .

Consider an arbitrary number N of sites S₁, …, S_N. We assume that $E (H_{S_{i}} | A_{S_{1}}, \dots, A_{S_{N}}) \sim E (H_{S_{i}} | A_{S_{i}})$ for any i. Then we have

\begin{matrix} E [\prod_{i = 1}^{N} H_{S_{i}}] = E [E [(\prod_{i = 1}^{N} H_{S_{i}}) | A_{S_{0}}, \dots, A_{S_{N}}]] \\ = & E [P (H_{S_{1}} = 1, \dots, H_{S_{N}} = 1 | A_{S_{0}}, \dots, A_{S_{N}})] \\ = & E [\prod_{i = 1}^{N} P (H_{S_{i}} = 1 | A_{S_{i}})] = E [\prod_{i = 1}^{N} (f_{S_{i}} + δ_{S_{i}} A_{S_{i}})] . \end{matrix}

(4)

Hence, we conclude that

\begin{matrix} cov (H_{S_{1}}, \dots, H_{S_{N}}) & = cov (g_{S_{1}} + δ_{S_{1}} A_{S_{1}}, \dots, g_{S_{N}} + δ_{S_{N}} A_{S_{N}}) \\ = cov (A_{S_{1}}, \dots, A_{S_{N}}) \prod_{i = 1}^{N} δ_{S_{i}} . \end{matrix}

(5)

In particular, we obtain Eq 3 with N = 3.

Local ancestry covariance functions

From the above section we see that we can describe the three-point admixture LD in terms of covariances of local ancestry in the three points. We now expand the covariance in Eq 2 into its component expectations to get

\begin{matrix} cov (A_{x}, A_{y}, A_{z}) = E [A_{x} A_{y} A_{z}] - E [A_{x} A_{y}] E [A_{z}] \\ - E [A_{x} A_{z}] E [A_{y}] - E [A_{y} A_{z}] E [A_{x}] + 2 E [A_{x}] E [A_{y}] E [A_{z}] . \end{matrix}

Each one of these expectations on the right-hand side is the probability that one or more sites is inherited from an ancestor from the first source population. We organize these products of probabilities in a column vector:

\begin{matrix} v_{3} & = (\begin{matrix} P {A_{x} = A_{y} = A_{z} = 1} \\ P {A_{y} = A_{z} = 1} P {A_{x} = 1} \\ P {A_{x} = A_{z} = 1} P {A_{y} = 1} \\ P {A_{x} = A_{y} = 1} P {A_{z} = 1} \\ P {A_{x} = 1} P {A_{y} = 1} P {A_{z} = 1} \end{matrix}), \end{matrix}

so that cov(A_x, A_y, A_z) = (1, −1, −1, −1, 2)v₃. There is one entry in v₃ for each of the five ways in which the three markers at positions x, y, and z can arranged on one or more chromosomes. In the founding generation T, this column vector is given by v_3(T) = (1 − m_T, (1 − m_T)², (1 − m_T)², (1 − m_T)², (1 − m_T)³)′. The probabilities for subsequent generations can be found by left-multiplying drift, recombination, and migration matrices:

\begin{matrix} v_{3 (i)} = D_{i} L U v_{3 (i - 1)}, \end{matrix}

The matrices D_i, L, and U account for the effects of migration, drift, and recombination, respectively. The migration matrix is a diagonal matrix given by

D_{i} = diag (1 - m_{i}, {(1 - m_{i})}^{2}, {(1 - m_{i})}^{2}, {(1 - m_{i})}^{2}, {(1 - m_{i})}^{3}) .

Its entries are the probabilities that one, two, or three chromosomes in the admixed population will not be replaced by chromosomes from the second source population in generation i. The lower triangular drift matrix

L = \frac{1}{4 N^{2}} (\begin{matrix} 4 N^{2} & 0 & 0 & 0 & 0 \\ 2 N & 2 N (2 N - 1) & 0 & 0 & 0 \\ 2 N & 0 & 2 N (2 N - 1) & 0 & 0 \\ 2 N & 0 & 0 & 2 N (2 N - 1) & 0 \\ 1 & 2 N - 1 & 2 N - 1 & 2 N - 1 & (2 N - 1) (2 N - 2) \end{matrix})

gives the standard Wright-Fisher drift transition probabilities between the states as a function of the population size 2N. Finally, the upper triangular recombination matrix is determined by the recombination rates between the three sites:

U = (\begin{matrix} e^{- d - d^{'}} & (1 - e^{- d}) e^{- d^{'}} & (1 - e^{- d}) (1 - e^{- d^{'}}) & e^{- d} (1 - e^{- d^{'}}) & 0 \\ 0 & e^{- d^{'}} & 0 & 0 & 1 - e^{- d^{'}} \\ 0 & 0 & 1 - e^{- d} - e^{- d^{'}} + 2 e^{- d - d^{'}} & 0 & e^{- d} + e^{- d^{'}} - 2 e^{- d - d^{'}} \\ 0 & 0 & 0 & e^{- d} & 1 - e^{- d} \\ 0 & 0 & 0 & 0 & 1 \end{matrix})

The covariance function is then given by

cov (A_{x}, A_{y}, A_{z}) = (1, - 1, - 1, - 1, 2) (\prod_{i = 0}^{T - 1} D_{i} L U) v_{3 (0)} .

(6)

We can obtain an analogous equation for cov(A_x, A_y), involving the migration, drift, and recombination matrices for two loci:

cov (A_{x}, A_{y}) = (1, - 1) (\prod_{i = 0}^{T - 1} D_{i} L U) v_{2 (0)} .

In some cases, Eq 6 simplifies further. In a one-pulse migration model, in which the admixture proportion in the founding generation is m_T = M and is there after 0, the D_i’s become identity matrices, and we get the closed from expression

cov (A_{x}, A_{y}, A_{z}) = M (1 - M) (1 - 2 M) {(1 - \frac{1}{2 N})}^{T} {(1 - \frac{2}{2 N})}^{T} e^{- T (d + d^{'})} .

This is because (1, −1, −1, −1, 2) is a left eigenvector of both L and U, with corresponding eigenvalues (1 − 1/2N)(1 − 2/2N) and exp(−d − d′). Note that when M = 0, the covariance function will be identically 0. Another case is a two pulse model in which we ignore the effects of genetic drift. In this model, admixture only occurs T and T₂ generations before the present, so that $m_{T} = M_{1}, m_{T_{2}} = M_{2}$ , and all other m_i’s are 0. Making the substitution T₁ = T − T₂, the right hand side of Eq 6 becomes

\begin{matrix} - (1 - M_{1}) (1 - M_{2}) e^{- T_{2} (d + d^{'})} [M_{2} {(1 - M_{1})}^{2} - 2 M_{2}^{2} {(1 - M_{1})}^{2} + M_{1} (1 - 2 M_{1}) e^{- T_{1} (d + d^{'})} \\ - M_{1} M_{2} (1 - M_{1}) (e^{- T_{1} d} + e^{- T_{1} d^{'}} + {(1 - e^{- d} - e^{- d^{'}} + 2 e^{- d - d^{'}})}^{T_{1}})] . \end{matrix}

(7)

The corresponding expression for the two-point covariance function is given by

\begin{matrix} (1 - M_{1}) (1 - M_{2}) e^{- T_{2} d} (M_{2} - M_{1} M_{2} + M_{1} e^{- T_{1} d}), \end{matrix}

(8)

which is a mixture of two exponentials.

Weighted linkage disequilibrium

As [26] noted, we cannot use the LD in the admixed population directly, because the allele frequency differences in the source populations can be of either sign. As in [26], we solve this problem by computing the product of the values of the three-point linkage disequilibrium coefficient with the product of the allele frequency differences. Using Eq 3 we obtain

δ_{x} δ_{y} δ_{z} D_{3} (d, d^{'}) = δ_{x}^{2} δ_{y}^{2} δ_{z}^{2} E [cov (A_{x}, A_{y}, A_{z})],

because the local ancestry in the admixed sample is independent of the allele frequencies in the admixed population. For inference purposes, we estimate this function by averaging over triples of SNPs which are separated by distances of approximately d and d′. The LD term is estimated from the admixed population, while the δ’s are estimated from reference populations which are closely related to the two source populations. We notice that both this approach, as well as the previous approaches (e.g., [26]), do not take genetic drift in the source populations after the time of admixture into account, i.e. there is an assumption of both this method and previous methods that the allele frequencies in the ancestral source populations can be approximated well using the allele frequencies in the extant populations.

We arrange the data from the admixed samples in an n × S_n matrix H, where n is the number of admixed haplotypes/genotypes, and S_n is the number of markers in the sample. Similarly, we arrange the data from the two source populations into two matrices, F and G, which are of size n₁ × S_n and n₂ × S_n, where n₁ and n₂ are the numbers of samples from each of the source populations. For ease of notation, we assume that the positions are given in units which make the unit interval equal to the desired bin width.

For a given d and d′ the SNP triples we use in the estimator for the weighted LD are

S [d, d^{'}] = {x, y, z : d \leq x - y < d + 1 and d^{'} \leq y - z < d^{'} + 1} .

Let h_x be empirical allele frequency in the admixed population. An estimator of the weighted three-point linkage disequilibrium coefficient is then

\hat{a} [d, d^{'}] = \frac{1}{| S [d, d^{'}] |} \sum_{x, y, z \in S [d, d^{'}]} \frac{n \sum_{i = 1}^{n} \hat{δ_{x}} \hat{δ_{y}} \hat{δ_{z}} (H_{i, x} - h_{x}) (H_{i, y} - h_{y}) (H_{i, z} - h_{z})}{(n - 1) (n - 2)},

where

\hat{δ_{x}} = \frac{\sum_{i = 1}^{n_{1}} F_{i, x}}{n_{1}} - \frac{\sum_{i = 1}^{n_{2}} G_{i, x}}{n_{2}},

and similarly for $\hat{δ_{y}}$ and $\hat{δ_{z}}$ .

Algorithm

Directly computing $\hat{a} [d, d^{'}]$ over the set d, d′ ∈ {0, 1, …, P}² would be cubic in the number of segregating sites. However, by using the fast Fourier Transform (FFT) technique introduced in ALDER [26], we can approximate $\hat{a}$ with an algorithm whose time complexity is instead linear in the number of segregating sites.

First, rearrange $\hat{a}$ to get

\hat{a} [d, d^{'}] = \frac{n}{(n - 1) (n - 2)} \frac{\sum_{i = 1}^{n} \sum_{x, y, z \in S [d, d^{'}]} \hat{δ_{x}} \hat{δ_{y}} \hat{δ_{z}} (H_{i, x} - h_{x}) (H_{i, y} - h_{y}) (H_{i, z} - h_{z})}{\sum_{x, y, z \in S [d, d^{'}]} 1},

and define sequences b_i[d] and c[d] by binning the data and then doubling the length by padding with P zeros,

\begin{matrix} b_{i} [d] = {\begin{matrix} \sum_{x : d \leq ⌊ x ⌋ < d + 1} \hat{δ_{x}} (H_{i, x} - h_{x}) & : 0 \leq d \leq P \\ 0 & : P < d \leq 2 P \end{matrix} \\ c [d] = {\begin{matrix} | {x : d \leq ⌊ x ⌋ < d + 1} | & : 0 \leq d \leq P \\ 0 & : P < d \leq 2 P \end{matrix} \end{matrix}

We can approximate |S[d, d′]| and the n sums in the numerator of $\hat{a} [d, d^{'}]$ in terms of convolutions of these sequences:

| S [d, d^{'}] | \approx \sum_{w = 0}^{P} c [w] c [w + d] c [w + d + d^{'}]

\sum_{x, y, z \in S [d, d^{'}]} \hat{δ_{x}} \hat{δ_{y}} \hat{δ_{z}} (H_{i, x} - h_{x}) (H_{i, y} - h_{y}) (H_{i, z} - h_{z}) \approx \sum_{w = 0}^{P} b_{i} [w] b_{i} [w + d] b_{i} [w + d + d^{'}] .

These convolutions can be efficiently computed with an FFT, since under a two-dimensional discrete Fourier transform from (d, d′)-space to (j, k)-space,

\sum_{w = 0}^{P} b_{i} [w] b_{i} [w + d] b_{i} [w + d + d^{'}] \leftrightarrow B_{i} [j] \bar{B_{i}} [k] B_{i} [k - j],

where B_i is the one-dimensional discrete Fourier transform of b and for j > 0, B_i[−j] is the j^th to last most element of B_i. Summing over i and taking the inverse discrete Fourier transform, we can approximate the discrete Fourier transform of the numerator of $\hat{a}$ . We apply the same method to c to approximate the denominator of $\hat{a}$ .

The time complexities for the binning and the FFT’s are O(S_n) and O(P² log(P)). Of these two, the first term will dominate, because P, the number of bins, is much smaller than S_n, the number of segregating sites.

Missing source population

When data from only one source population are available, it is still possible to estimate the weighted admixture linkage disequilibrium by estimating the difference in allele frequencies between the two source populations using the allele frequency differences between the available population and the admixed population [26, 35], by way of the following formula

h_{x} = f_{x} (1 - M) + g_{x} M,

where M is the admixture proportion. For two pulses of admixture with proportions M₁ and M₂, a similar equation holds

\begin{matrix} h_{x} = (f_{x} (1 - M_{1}) + g_{x} M_{1}) (M_{2} - 1) + g_{x} M_{2} . \end{matrix}

(9)

The allele frequencies in the missing source population can be estimated from this equation by solving for the relevant unknown term (either f_x or g_x). This estimator might be noisy for rare variants, so sites with minor allele frequencies of less than 0.05 should be removed (this corresponds to standard filtering practices for real data).

When using only the admixed population itself as a reference population, the method described above will be biased if the same samples are used to estimate both the linkage disequilibrium coefficients and the weights (δ_x, δ_y, and δ_z). We cannot efficiently compute a polyache statistics like [26]. At the cost of some power, we instead adopt the approach of [35] and separate the admixed population into two equal-sized groups. We then use one group to estimate the weights, and the other group to estimate linkage disequilibrium coefficients, and vice versa. This gives two unbiased estimates for the numerator of $\hat{a}$ , which we then average.

Another challenge with real data, is that the method might be unstable when admixture proportions are not known. For two pulses of admixture, we have four independent parameters T₁, T₂ and M1, M2. In order to simplify the problem, one can estimate the total ancestry fractions of the source populations in the admixed population using ADMIXTURE [5] with K = 2. Assume that M is the ancestry fraction of the source population which admixed two times. Then

M = M_{1} (1 - M_{2}) + M_{2} .

This equations are closely related to Eq 9. This allows to reduce the number of independent parameters to 3, which simplifies the optimization problem substantially.

Fitting the two-pulse model

We fit Eq 7 to the estimates of the weighted LD using non-linear least squares, with two modifications. We added a proportionality constant to account for the expected square allele frequency difference between the source populations. We also subtracted out an affine term in the weighted LD which is due to population substructure [26]. We estimated this by computing the three-way covariance between triples of chromosomes. We use the jackknife to obtain confidence intervals for the resulting estimates by leaving out each chromosome in turn and refitting on the data for the remaining chromosomes.

Verification and comparison

We used the package msprime [36] to generate two source populations which diverged 4000 generations ago and a coalescent simulation to generate an admixed population from the two source populations according to two-pulse and constant admixture models. We sampled 50 diploid individuals from the admixed and two source populations, each consisting of 20 chromosomes of length 1 Morgan. The effective population size was 2N = 1000 for the admixed population and two source populations. Using a two pulse model, we varied the migration probabilities and timings for each pulse to examine the accuracy of Eq 7. We also simulated data for a model with a constant rate of admixture each generation, and compared this to the predictions made by Eq 6.

Our implementation uses Python package cyvcf2 [37] to read VCF files.

Patterns of 3-locus LD

We first evaluate the accuracy of the equations developed in this paper by comparing the analytical results to simulated data (Figs 1–3). We find there is a generally a close match between our equations and the simulated data under both the two-pulse admixture scenarios (Figs 1 and 2) and constant-admixture scenarios (Fig 3). The exception is when the total admixture proportion M₂ + M₁(1 − M₂) is close to 0.5. As the total admixture proportion increases above 0.5, the contours for Eq 2 flip from being concave down to concave up. This transition can be seen by comparing the upper left side of Fig 2 to its lower right. At this threshold, the contours of the estimated weighted LD depend on the actual admixture fractions of the samples, which may differ from the expectation as a result of genetic drift. This mismatch between theory and simulations is most evident in Fig 2, for m₁ = 0.1, m₂ = 0.4 and m₁ = 0.2, m₂ = 0.4.

Fig 1 — The heat maps are from simulations and the contours are plotted from Eq 7. The two admixture probabilities were fixed at m₁ = m₂ = .2 and the the times of the two admixture pulses, T₁ and T₂, were varied. Each square covers the range 0.5 cM < d, d′ < 20 cM. When time of the more recent pulse is greater than half of that of the more ancient pulse, i.e. 2T₁ > T₁ + T₂, the contours of the resulting weighted LD surface are straight, making it difficult to distinguish from the weighted LD surface produced by a one-pulse admixture scenario.

Fig 3 — The heat maps are from simulations and the contours from analytical results for a model in which continuous admixture started 10, 20, or 40 generations ago and stopped 5 generations before the present. Each square covers the range 0.5 cM < d, d′ < 20 cM. We varied the time of the beginning of the admixture and the total admixture probability. The admixture probability for each generation was constant, and chosen so that the total admixture proportion was either 0.3 or 0.7. When the admixture is spread over 5 generations (the leftmost column), the resulting weighted LD surface is similar to a one-pulse weighted LD surface. For longer durations, the weighted LD surfaces are similar to those produced by two pulses of admixture.

Fig 2 — The heat maps are from simulations and the contours are plotted from Eq 7. The two admixture times were fixed at 2 and 12 generations ago (T₁ = 10 and T₂ = 2) while the admixture probabilities were varied. Each square covers the range 0.5 cM < d, d′ < 20 cM. As the total admixture proportion m₂ + m₁(1 − m₂) increases above 0.5, the contours change to reflecting that the majority contribution of the genetic material now originates from the other population. Weighted LD surfaces for m₁ > 0.5 or m₂ > 0.5 are not shown, but are qualitatively similar to the surfaces on the lower and rightmost sides.

When there is continuous admixture scenario, the shape of the weighted LD surface depends on both the duration and total amount of admixture. When the duration is short, the weighted LD surfaces are indistinguishable from the weighted LD surfaces produced by one pulse of migration. As the duration increases, the contours of the weighted LD surface become more curved. The contours are concave up when the total proportion is greater than 50% and concave down when it is less. When the total proportion is exactly 50%, the amplitude of the weighted LD surface is much smaller than the sampling error.

For two pulse models, the effects of the second pulse of migration only become evident when temporal spacing between the pulses is large enough (T₁ > T₂). Otherwise, the resulting weighted LD surface cannot be distinguished from the weighted LD surface produced by one pulse of admixture. As in the case of continuous admixture the concavity of the surface contours is determined by the total admixture proportion.

Comparison to two-locus LD measures

We compared the simulation results to the two-locus weighted LD calculated by ALDER (Fig 4). The information used in estimating Admixture times in ALDER is the slope of the log-scaled LD curves. Notice (Fig 4) that the slopes are somewhat similar for admixture models with identical values of the most recent admixture events (T₂). Hence, when two admixture events have occurred, estimation of admixture times tend to get weighted towards the most recent event. Generally, it would be very difficult, based on the shape of the admixture LD decay curve to estimate parameters of a model with more than one admixture event. In contrast, there is a quite clear change in the pattern of three-locus LD as long as the time between the two admixture events is sufficiently large (Fig 1).

Accuracy of parameter estimates

We next evaluate the utility of the method for estimating admixture times. The qualitative similarities between one pulse and two pulse admixture scenarios seen in the previous simulations under some parameter settings will naturally affect the estimates. As shown in Fig 5, when the spacing between the two pulses is small relative to their age, the median of the estimates of the timing of the second pulse is close to the true value, but the interquartile range is large. Moreover, the best fit often lies on a boundary of the parameter space which is equivalent to a one pulse admixture model. When the spacing between the pulses is larger, the estimates for the timing of the older pulse become more precise. ALDER estimates a single admixture time (which corresponds to T₁ = 0 in our models). There is less variance in this estimate, as it can be explained by a single unknown parameter (T₂) compared to three free parameters estimated in our method (T₁, T₂ and admixture proportion M₁).

Fig 5 — Twelve admixture scenarios, T₁ ∈ {0, 5, 10, 20} and T₂ ∈ {2, 5, 10}, were simulated 100 times each. The admixture probabilities were fixed at M₁ = 0.3 and M₂ = 0.2. The colored bars give the medians of estimates for each of these twelve cases, the boxes delimit the interquartile range, and the whiskers extend out to 1.5 times the interquartile range. As the time between the two pulses of admixture increases, the error in the estimates decreases (for this reason we do not include T₁ = 0 accuracy estimate, in this case the results become unreasonable). Consistent with the simulations shown in Fig 1, there is limited power to estimate the time of the more ancient admixture pulse when T₂ > T₁. ALDER estimates a single admixture time which corresponds to T₁ = 0.

We evaluated how admixture proportion mis-specification might affect the admixture time estimates. The results are summarised in Fig 6. The timing of the most recent admixture pulse is rather stable to variation in the admixture proportion, while the timing of the older pulse turns out to be quite sensitive to it.

Applications

To illustrate the utility of the method we computed weighted LD surfaces for Mexican and Colombian samples individuals in the third phase of the 1000 Genomes data set [38]. These samples were previously analyzed for similar purposes by [28]. Our datasets consisted of 64 individuals from Los Angeles and 94 individuals from Medellin, respectively. We used the 104 Yoruba samples as a reference population. We removed indels and SNVs and leave SNPs that only refers to autosomes. (All filters are included in utilites/preparation.sh in https://github.com/Genomics-HSE/LaNeta.) We computed the weighted LD on the genotypes to avoid effects of phasing errors.

For the Mexican samples, [28] found a small but consistent amount of African ancestry, which appeared in the population 15 generations ago, with continuing contributions from European and Native American populations since that date, but no African migration. We fitted a two-pulse model to the Mexican weighted LD surface (Fig 7) with Yoruba as the first source population and the other population being modeled as a missing source population, as previously described. The missing source population (non-Yoruban) represents an unknown admixed European and Native American population. This model was chosen to mimic the previous analysis of reference [28] which used a similar model to approximate continued gene-flow from Europeans and Native Americans. Using this set-up we estimated that the two pulses occurred 13.2 ± 1.01 and 7.9 ± 0.99 generations ago. Our results are roughly consistent with those of [28].

Fig 7 — The model with the best fit is two pulses from the non-Yoruba source population at T₁ + T₂ = 13.2 ± 1.01 and T₂ = 7.9 ± 0.99 generations ago. The weighted LD surface was estimated from real data, the level lines correspond to the best-fitting model inferred by LaNeta method.

The weighted LD surface for the Colombia samples with Yoruba as the first source population is shown in Fig 8. From this, we estimated two pulses of non-Yoruba migration at 14.5 ± 0.74 and 3.7 ± 0.62 generations before the present. [28] inferred two pulses of admixture, corresponding to 13 and 5 generations ago. The weighted LD surface of the Colombian samples has contours which are strongly concave up, in contrast to those of the Mexican samples.

Discussion

The method presented here is an extension of previously published methods for using weighted two-locus LD to estimate admixture times. The new method uses more information in the data because it compares triples of SNPs instead of pairs. This gives the method the ability to infer admixture histories more complex than a one-pulse model. However, this comes at the price of greater estimation variances. ALDER and ROLLOFF make estimates from just tens of samples, while our method requires hundreds of samples. Part of this difference can be attributed to the fact that ALDER and ROLLOFF make inferences over a smaller class of models, but the main reason arises from the fact that the two-locus methods are estimating second moments of the data, while we are estimating third moments. The variance of these estimates are both inversely proportional to the sample size, but the constants for estimating third moments are larger. As data becomes more readily available, this disadvantage should disappear.

We also note that the theory developed in this paper might be useful for other purposes than estimating admixture times. In particular, it can be used to test hypotheses regarding the spatial distribution of introgressed fragments in the genome, without relying on particular inferences of admixture tracts.

Acknowledgments

This research was supported in part through computational resources of HPC facilities at HSE University [39].

Data Availability

Software is openly available in the repository https://github.com/Genomics-HSE/LaNeta. Data sharing is not applicable as this study only analyses previously published and openly available data. The analysis uses genomic data from Yoruba, Colombian and Mexican populations. Whole-genome sequence data are available from the third phase of 1000 Genome Project (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/). All filters are included in LaNeta in the file prepararion.sh (https://github.com/Genomics-HSE/LaNeta/blob/main/utilites/preparation.sh).

Funding Statement

MS and VS worked on this paper within the framework of the HSE University Basic Research Program (hse.ru). AM was supported by the grant RFBR 20-29-01028 (rfbr.ru). RN was supported by NIH grant R01GM138634 (nih.gov). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Reich D, Thangaraj K, Patterson N, Price AL, Singh L. Reconstructing Indian population history. Nature. 2009;461(7263):489–494. doi: 10.1038/nature08365 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, et al. Ancient admixture in human history. Genetics. 2012;192(3):1065–93. doi: 10.1534/genetics.112.145037 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Durand EY, Patterson N, Reich D, Slatkin M. Testing for ancient admixture between closely related populations. Mol Biol Evol. 2011;28(8):2239–52. doi: 10.1093/molbev/msr048 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155(2):945–959. doi: 10.1093/genetics/155.2.945 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Research. 2009;19(9):1655–1664. doi: 10.1101/gr.094052.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: A Discriminative Modeling Approach for Rapid and Robust Local-Ancestry Inference. The American Journal of Human Genetics. 2013;93(2):278–288. doi: 10.1016/j.ajhg.2013.06.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Pool JE, Nielsen R. Inference of historical changes in migration rate from the lengths of migrant tracts. Genetics. 2009;181(2):711–719. doi: 10.1534/genetics.108.098095 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Gravel S. Population genetics models of local ancestry. Genetics. 2012;191(2):607–619. doi: 10.1534/genetics.112.139808 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Liang M, Nielsen R. The Lengths of Admixture Tracts. Genetics. 2014; p. genetics–114. doi: 10.1534/genetics.114.162362 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Corbett-Detig R, Nielsen R. A Hidden Markov Model Approach for Simultaneously Estimating Local Ancestry and Admixture Time Using Next Generation Sequence Data in Samples of Arbitrary Ploidy. PLOS Genetics. 2017;13(1):1–40. doi: 10.1371/journal.pgen.1006529 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Svedberg J, Shchur V, Reinman S, Corbett-Detig R. Inferring Adaptive Introgression Using Hidden Markov Models. Molecular Biology and Evolution. 2021;38. doi: 10.1093/molbev/msab014 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Ni X., Yuan K., Yang X., Feng Q., Guo W., Ma Z. & Xu S. Inference of multiple-wave admixtures by length distribution of ancestral tracks. Heredity. 121, 52–63 (2018) doi: 10.1038/s41437-017-0041-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Fisher RA. The Theory of Inbreeding. Edinburgh, Scotland: Oliver and Boyd; 1949. [Google Scholar]
14. Stam P. The distribution of the fraction of the genome identical by descent in finite random mating populations. Genetics Research. 1980;35:131–155. doi: 10.1017/S0016672300014002 [DOI] [Google Scholar]
15. Guo SW. Computation of identity-by-descent proportions shared by two siblings. American Journal of Human Genetics. 1994;54(6):1104. [PMC free article] [PubMed] [Google Scholar]
16. Bickeböller H, Thompson EA. Distribution of genome shared IBD by half-sibs: approximation by the Poisson clumping heuristic. Theoretical Population Biology. 1996;50(1):66–90. doi: 10.1006/tpbi.1996.0023 [DOI] [PubMed] [Google Scholar]
17. Bickeböller H, Thompson EA. The probability distribution of the amount of an individual’s genome surviving to the following generation. Genetics. 1996;143(2):1043–1049. doi: 10.1093/genetics/143.2.1043 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Stefanov VT. Distribution of genome shared identical by descent by two individuals in grandparent-type relationship. Genetics. 2000;156(3):1403–1410. doi: 10.1093/genetics/156.3.1403 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Ball F, Stefanov VT. Evaluation of identity-by-descent probabilities for half-sibs on continuous genome. Mathematical Biosciences. 2005;196(2):215–225. doi: 10.1016/j.mbs.2005.04.005 [DOI] [PubMed] [Google Scholar]
20. Cannings C. The identity by descent process along the chromosome. Human heredity. 2003;56(1-3):126–130. doi: 10.1159/000073740 [DOI] [PubMed] [Google Scholar]
21. Dimitropoulou P, Cannings C. RECSIM and INDSTATS: probabilities of identity in general genealogies. Bioinformatics. 2003;19(6):790–791. doi: 10.1093/bioinformatics/btg060 [DOI] [PubMed] [Google Scholar]
22. Walters K, Cannings C. The probability density of the total IBD length over a single autosome in unilineal relationships. Theoretical Population Biology. 2005;68(1):55–63. doi: 10.1016/j.tpb.2005.03.004 [DOI] [PubMed] [Google Scholar]
23. Rodolphe F, Martin J, Della-Chiesa E. Theoretical description of chromosome architecture after multiple back-crossing. Theoretical Population Biology. 2008;73(2):289–299. doi: 10.1016/j.tpb.2007.11.004 [DOI] [PubMed] [Google Scholar]
24. Baird SJ, Barton NH, Etheridge AM. The distribution of surviving blocks of an ancestral genome. Theoretical Population Biology. 2003;64(4):451–471. doi: 10.1016/S0040-5809(03)00098-4 [DOI] [PubMed] [Google Scholar]
25. Moorjani P, Patterson N, Hirschhorn JN, Keinan A, Hao L, Atzmon G, et al. The history of African gene flow into Southern Europeans, Levantines, and Jews. PLoS genetics. 2011;7(4):e1001373. doi: 10.1371/journal.pgen.1001373 [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Loh PR, Lipson M, Patterson N, Moorjani P, Pickrell JK, Reich D, et al. Inferring admixture histories of human populations using linkage disequilibrium. Genetics. 2013;193(4):1233–1254. doi: 10.1534/genetics.112.147330 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Moreno-Mayar JV, Rasmussen S, Seguin-Orlando A, Rasmussen M, Liang M, Flåm ST, et al. Genome-wide Ancestry Patterns in Rapanui Suggest Pre-European Admixture with Native Americans. Current Biology. 2014;. doi: 10.1016/j.cub.2014.09.057 [DOI] [PubMed] [Google Scholar]
28. Gravel S, Zakharia F, Moreno-Estrada A, Byrnes JK, Muzzio M, Rodriguez-Flores JL, et al. Reconstructing native American migrations from whole-genome and whole-exome data. PLoS genetics. 2013;9(12):e1004023. doi: 10.1371/journal.pgen.1004023 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Bennett J. On the theory of random mating. Annals of Eugenics. 1952;17(1):311–317. doi: 10.1111/j.1469-1809.1952.tb02522.x [DOI] [PubMed] [Google Scholar]
30. Slatkin M. On treating the chromosome as the unit of selection. Genetics. 1972;72(1):157–168. doi: 10.1093/genetics/72.1.157 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Hellenthal G., Busby G., Band G., Wilson J., Capelli C., Falush D. & Myers Simon A Genetic Atlas of Human Admixture History. Science. 343, 747–751 (2014), https://www.science.org/doi/abs/10.1126/science.1243518 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Lawson D., Hellenthal G., Myers S. & Falush D. Inference of population structure using dense haplotype data. PLoS Genetics. 8, e1002453 (2012) doi: 10.1371/journal.pgen.1002453 [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Liang M, Nielsen R. Understanding admixture fractions. bioRxiv. 2014; p. 008078. [Google Scholar]
34. Verdu P, Rosenberg NA. A general mechanistic model for admixture histories of hybrid populations. Genetics. 2011;189(4):1413–1426. doi: 10.1534/genetics.111.132787 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Pickrell JK, Pritchard JK. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS genetics. 2012;8(11):e1002967. doi: 10.1371/journal.pgen.1002967 [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Kelleher J, Etheridge AM, McVean G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS computational biology. 2016;12(5):e1004842. doi: 10.1371/journal.pcbi.1004842 [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Pedersen BS, Quinlan AR. cyvcf2: fast, flexible variant analysis with Python. Bioinformatics. 2017;. doi: 10.1093/bioinformatics/btx057 [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Consortium GP, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68. doi: 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Kostenetskiy P, Chulkevich R, Kozyrev V. HPC Resources of the Higher School of Economics. In: Journal of Physics: Conference Series. vol. 1740. IOP Publishing; 2021. p. 012050.

PLoS Genet. doi: 10.1371/journal.pgen.1010281.r001

Decision Letter 0

David Balding, Garrett Hellenthal

24 Mar 2022

Dear Dr Nielsen,

Thank you very much for submitting your Research Article entitled 'Estimating the timing of multiple admixture events using 3-locus Linkage Disequilibrium' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Garrett Hellenthal

Guest Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

Editor comments:

Thank you for your manuscript. Both reviewers were positive about your paper and believe it is likely worthy of publication in PLoS Genetics after minor amendments. We agree, and have the following points to add:

(1) It would be helpful to make clear in the Introduction that this new technique is designed to detect multiple admixture pulses involving two sources, rather than (e.g.) involving three distinct sources.

(2) Following from this, the 1000 Genomes applications were a bit unclear, given you analyse populations that are admixed from three sources. For example, what do you mean by "with Yoruba as reference"? I think you are assuming two pulses of admixture from a non-African source into an African one for these analyses? If so, it is not clear why this makes more sense than considering two pulses of African admixture. Also, what is the "non-Yoruba" source here -- is it an admixed Native/European source? It would be helpful to explicitly spell out what you used as surrogates for the admixing sources 1 and 2 here.

(3) Your paper figures are not likely to be very intuitive for most readers, including us. What are the lines in Figure 6 and 7, for example? Are these representing the best-fitting model? Most readers are likely to be more interested in plots like Figure 5, where you show estimation accuracy. But why is this only shown for T1 here, and not for T2? Readers might also be interested in what happens when the admixture proportions fixed in your model are inaccurate. While this is mentioned as a potential issue, at what point does it break down?

(4) Detailed theory and applications involving multiple pulses of admixture, or specifically admixture between two sources followed by additional admixture from another source, have been described previously (Hellenthal et al 2014, Science 343:747; Ni et al 2018, Heredity 121:52), yet you do not reference these here. I appreciate your approach is different and has novelty, but it might be worth citing these in the Introduction.

Reviewer's Comments to the Authors:

Reviewer #1: The authors provide an interesting exploration of the genomic effects of admixture by leveraging information from three linked sites. I believe the main value of their work stands in the theoretical advancement and demonstration of its match with coalescent simulations, and by itself this part (which I was not able to fully evaluate due to my knowledge gap) is worth of publication unless other reviewers found major issues that I may have overlooked.

I have only minor concerns regarding the practical applications of the work:

1) It is not clear whether the model choice (one/two/continuous waves of admixture) is a decision to be made a-priori by the operator or whether that appears from the results (eg. in Figure 6, should one look at the green portion of the plot to find out which are the best parameters along the Y and X axes?)

2) Although the ALDER output is somewhat used (Fig 4) as a reference for their novel results, it would be desirable to have a more clear benchmarking to assess whether A) the novel method is capable of discriminating between one, two or continuous admixture scenarios and B) by adding the ALDER (or MALDER) performance to figure 5, to what extent is the new approach better/comparable in retrieving admixture dates from the simulated data?

Reviewer #2: In this paper the authors build upon existing theory for estimating the timing of admixture events based on the decay of two-locus admixture linkage disequilibrium (ALD) by developing a framework for a three-locus model. The major advantage of this new model is that it is capable of disentangling the timings of more than one wave of admixture. The authors test their model extensively against simulated data for a number of multi wave admixture scenarios (e.g. different admixture event timings, different admixture proportions and two-pulse vs continuous admixture) to establish how well it performs under a range of conditions. Finally they test their method on real data by modelling two-pulse admixture in Mexican and Columbian data from the 1000 genomes and comparing their results with the findings of Gravel et al. 2013.

I believe this article would be of great interest to readers of PLoS Genetics. The theory developed within is novel, rigorous and extensively tested, and has the potential to improve the study of admixture timing. The authors have carefully outlined the situations in which their model performs best and those in which it loses power which will be of great help to researchers attempting to apply it to their own data. Last but not least they have provided an implementation of their code making the method freely accessible to the research community. For these reasons I would recommend that the article is published once a few minor issues are addressed (see below)

1.) Reference to equations in text:

I noticed two instances where the equation referred to in the text does not seem to match the equation being used in the analysis. I could be entirely mistaken here, but I thought I’d point them out just to be safe.

1.1) References to equation 8: Equation 8 is mentioned twice in the paper as the model used for two-pulse admixture (Line 212; Lines 225-227). However when inspecting this equation in the methods it appears to be a two-locus model rather than a three-locus model (only has a single genetic distance parameter d and no second d’ parameter). Further supporting this interpretation, the line preceding equation 8 (line 134) explicitly states that it is an “expression for the two-point covariance function”.

My question is did you mean to use the two-locus model here (eq 8) or is this supposed to be equation 7? (Eq 7 appears to be the 3-locus form)

1.2.) References to equation 2: From the results section (lines 237-239 page 14) it seems like the model being tested in figures 1 and 2 is a two-pulse admixture model (Eq 7 or eq 8 depending on the answer to the above point). However in the figure legends of figures 1 and 2 (and on line 241) it is stated that the contours in these figures come from equation 2. Is this correct? I don’t see any parameters for admixture timings, or proportions (the variables being explored in the plots) in equation 2 so I am unclear how this would work.

2.) Issues with figures:

2.1.) Figures 1-3:

While the message of these plots regarding how well theory and simulations line up under different admixture scenarios is nicely illustrated, I initially found it difficult to determine what exactly is being plotted in these figures for the following reasons:

i.) No axis labels: The heatmaps lack any x and y axis labels making it unclear at first glance what is being plotted. Please provide axis labels.

ii.) Parameter labels where axis labels expected: The grid labels describing the parameters for a given simulation (e.g. T1=0, T2=10 for figure 1) fall where I would expect the missing x and y axis labels to be (bottom and left sides of a plot) leading to further confusion. Maybe separating these grid labels from the plots using a dividing line, or placing them to the right and top of the plot would help to avoid confusion.

iii.) No scale of heatmap: There is no scale provided for the heatmaps meaning the colours are hard to interpret. Please provide a scale bar.

2.2) Figure 4:

This figure needs a few adjustments to improve legibility I recommend the following tweaks:

i.) Axis labels: Please label the x and y axes in the plot for clarity. The axes are currently described in the figure legend but readability would be increased by labelling the axes directly.

ii.) Plot key: The key in the figure could be cleaned up by reordering it into three columns grouped by T2 values (i.e. First column is all T2=2, second is all T2=5 and third is all T2=10) to improve figure legibility. This way all reds, all greens and all blues are together and the reader will understand that the colour scheme refers to the T2 values intuitively. This will also make the message of the plot clearer.

iii.) T1 values ambiguous: Currently there is no way to distinguish which lines belong to which value of T1 as the colours of each line within a T2 value group are identical. This could easily be solved by using different line types for each value of T1 (i.e. solid lines for T1=10, dashed lines for T1=5 and dotted lines for T1=10) or different shades of the grouping colour (i.e. T1=0 is light, T1=5 is medium, T1=10 is dark).

2.3.) Figures 6-7: Labelling the scale bar here may be helpful to readers.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Ross Patrick Byrne

PLoS Genet. 2022 Jul 15;18(7):e1010281. doi: 10.1371/journal.pgen.1010281.r002

Author response to Decision Letter 0

12 May 2022

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(289.6KB, docx)}

PLoS Genet. doi: 10.1371/journal.pgen.1010281.r003

Decision Letter 1

David Balding, Garrett Hellenthal

2 Jun 2022

Dear Dr Nielsen,

We are pleased to inform you that your manuscript entitled "Estimating the timing of multiple admixture events using 3-locus Linkage Disequilibrium" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Thank you for addressing the editor and reviewer concerns. In preparing the final version for publication in PLoS Genetics, we ask you to make minor changes to address the following two comments:

1) In Verification and Comparison, please remind readers of the definition of T1 and T2, as this may be missed for those that only quickly peruse the Methods.

2) In Fig 6 state the true dates of T1 and T2, and true admixture proportion here (presumably 0.36 ?)

In addition, before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Garrett Hellenthal

Guest Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly:

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-21-01477R1

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

PLoS Genet. doi: 10.1371/journal.pgen.1010281.r004

Acceptance letter

David Balding, Garrett Hellenthal

12 Jul 2022

PGENETICS-D-21-01477R1

Estimating the timing of multiple admixture events using 3-locus Linkage Disequilibrium

Dear Dr Nielsen,

We are pleased to inform you that your manuscript entitled "Estimating the timing of multiple admixture events using 3-locus Linkage Disequilibrium" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Olena Szabo

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(289.6KB, docx)}

Data Availability Statement

[pgen.1010281.ref001] 1. Reich D, Thangaraj K, Patterson N, Price AL, Singh L. Reconstructing Indian population history. Nature. 2009;461(7263):489–494. doi: 10.1038/nature08365 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref002] 2. Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, et al. Ancient admixture in human history. Genetics. 2012;192(3):1065–93. doi: 10.1534/genetics.112.145037 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref003] 3. Durand EY, Patterson N, Reich D, Slatkin M. Testing for ancient admixture between closely related populations. Mol Biol Evol. 2011;28(8):2239–52. doi: 10.1093/molbev/msr048 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref004] 4. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155(2):945–959. doi: 10.1093/genetics/155.2.945 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref005] 5. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Research. 2009;19(9):1655–1664. doi: 10.1101/gr.094052.109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref006] 6. Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: A Discriminative Modeling Approach for Rapid and Robust Local-Ancestry Inference. The American Journal of Human Genetics. 2013;93(2):278–288. doi: 10.1016/j.ajhg.2013.06.020 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref007] 7. Pool JE, Nielsen R. Inference of historical changes in migration rate from the lengths of migrant tracts. Genetics. 2009;181(2):711–719. doi: 10.1534/genetics.108.098095 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref008] 8. Gravel S. Population genetics models of local ancestry. Genetics. 2012;191(2):607–619. doi: 10.1534/genetics.112.139808 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref009] 9. Liang M, Nielsen R. The Lengths of Admixture Tracts. Genetics. 2014; p. genetics–114. doi: 10.1534/genetics.114.162362 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref010] 10. Corbett-Detig R, Nielsen R. A Hidden Markov Model Approach for Simultaneously Estimating Local Ancestry and Admixture Time Using Next Generation Sequence Data in Samples of Arbitrary Ploidy. PLOS Genetics. 2017;13(1):1–40. doi: 10.1371/journal.pgen.1006529 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref011] 11. Svedberg J, Shchur V, Reinman S, Corbett-Detig R. Inferring Adaptive Introgression Using Hidden Markov Models. Molecular Biology and Evolution. 2021;38. doi: 10.1093/molbev/msab014 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref012] 12. Ni X., Yuan K., Yang X., Feng Q., Guo W., Ma Z. & Xu S. Inference of multiple-wave admixtures by length distribution of ancestral tracks. Heredity. 121, 52–63 (2018) doi: 10.1038/s41437-017-0041-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref013] 13. Fisher RA. The Theory of Inbreeding. Edinburgh, Scotland: Oliver and Boyd; 1949. [Google Scholar]

[pgen.1010281.ref014] 14. Stam P. The distribution of the fraction of the genome identical by descent in finite random mating populations. Genetics Research. 1980;35:131–155. doi: 10.1017/S0016672300014002 [DOI] [Google Scholar]

[pgen.1010281.ref015] 15. Guo SW. Computation of identity-by-descent proportions shared by two siblings. American Journal of Human Genetics. 1994;54(6):1104. [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref016] 16. Bickeböller H, Thompson EA. Distribution of genome shared IBD by half-sibs: approximation by the Poisson clumping heuristic. Theoretical Population Biology. 1996;50(1):66–90. doi: 10.1006/tpbi.1996.0023 [DOI] [PubMed] [Google Scholar]

[pgen.1010281.ref017] 17. Bickeböller H, Thompson EA. The probability distribution of the amount of an individual’s genome surviving to the following generation. Genetics. 1996;143(2):1043–1049. doi: 10.1093/genetics/143.2.1043 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref018] 18. Stefanov VT. Distribution of genome shared identical by descent by two individuals in grandparent-type relationship. Genetics. 2000;156(3):1403–1410. doi: 10.1093/genetics/156.3.1403 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref019] 19. Ball F, Stefanov VT. Evaluation of identity-by-descent probabilities for half-sibs on continuous genome. Mathematical Biosciences. 2005;196(2):215–225. doi: 10.1016/j.mbs.2005.04.005 [DOI] [PubMed] [Google Scholar]

[pgen.1010281.ref020] 20. Cannings C. The identity by descent process along the chromosome. Human heredity. 2003;56(1-3):126–130. doi: 10.1159/000073740 [DOI] [PubMed] [Google Scholar]

[pgen.1010281.ref021] 21. Dimitropoulou P, Cannings C. RECSIM and INDSTATS: probabilities of identity in general genealogies. Bioinformatics. 2003;19(6):790–791. doi: 10.1093/bioinformatics/btg060 [DOI] [PubMed] [Google Scholar]

[pgen.1010281.ref022] 22. Walters K, Cannings C. The probability density of the total IBD length over a single autosome in unilineal relationships. Theoretical Population Biology. 2005;68(1):55–63. doi: 10.1016/j.tpb.2005.03.004 [DOI] [PubMed] [Google Scholar]

[pgen.1010281.ref023] 23. Rodolphe F, Martin J, Della-Chiesa E. Theoretical description of chromosome architecture after multiple back-crossing. Theoretical Population Biology. 2008;73(2):289–299. doi: 10.1016/j.tpb.2007.11.004 [DOI] [PubMed] [Google Scholar]

[pgen.1010281.ref024] 24. Baird SJ, Barton NH, Etheridge AM. The distribution of surviving blocks of an ancestral genome. Theoretical Population Biology. 2003;64(4):451–471. doi: 10.1016/S0040-5809(03)00098-4 [DOI] [PubMed] [Google Scholar]

[pgen.1010281.ref025] 25. Moorjani P, Patterson N, Hirschhorn JN, Keinan A, Hao L, Atzmon G, et al. The history of African gene flow into Southern Europeans, Levantines, and Jews. PLoS genetics. 2011;7(4):e1001373. doi: 10.1371/journal.pgen.1001373 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref026] 26. Loh PR, Lipson M, Patterson N, Moorjani P, Pickrell JK, Reich D, et al. Inferring admixture histories of human populations using linkage disequilibrium. Genetics. 2013;193(4):1233–1254. doi: 10.1534/genetics.112.147330 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref027] 27. Moreno-Mayar JV, Rasmussen S, Seguin-Orlando A, Rasmussen M, Liang M, Flåm ST, et al. Genome-wide Ancestry Patterns in Rapanui Suggest Pre-European Admixture with Native Americans. Current Biology. 2014;. doi: 10.1016/j.cub.2014.09.057 [DOI] [PubMed] [Google Scholar]

[pgen.1010281.ref028] 28. Gravel S, Zakharia F, Moreno-Estrada A, Byrnes JK, Muzzio M, Rodriguez-Flores JL, et al. Reconstructing native American migrations from whole-genome and whole-exome data. PLoS genetics. 2013;9(12):e1004023. doi: 10.1371/journal.pgen.1004023 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref029] 29. Bennett J. On the theory of random mating. Annals of Eugenics. 1952;17(1):311–317. doi: 10.1111/j.1469-1809.1952.tb02522.x [DOI] [PubMed] [Google Scholar]

[pgen.1010281.ref030] 30. Slatkin M. On treating the chromosome as the unit of selection. Genetics. 1972;72(1):157–168. doi: 10.1093/genetics/72.1.157 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref031] 31. Hellenthal G., Busby G., Band G., Wilson J., Capelli C., Falush D. & Myers Simon A Genetic Atlas of Human Admixture History. Science. 343, 747–751 (2014), https://www.science.org/doi/abs/10.1126/science.1243518 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref032] 32. Lawson D., Hellenthal G., Myers S. & Falush D. Inference of population structure using dense haplotype data. PLoS Genetics. 8, e1002453 (2012) doi: 10.1371/journal.pgen.1002453 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref033] 33. Liang M, Nielsen R. Understanding admixture fractions. bioRxiv. 2014; p. 008078. [Google Scholar]

[pgen.1010281.ref034] 34. Verdu P, Rosenberg NA. A general mechanistic model for admixture histories of hybrid populations. Genetics. 2011;189(4):1413–1426. doi: 10.1534/genetics.111.132787 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref035] 35. Pickrell JK, Pritchard JK. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS genetics. 2012;8(11):e1002967. doi: 10.1371/journal.pgen.1002967 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref036] 36. Kelleher J, Etheridge AM, McVean G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS computational biology. 2016;12(5):e1004842. doi: 10.1371/journal.pcbi.1004842 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref037] 37. Pedersen BS, Quinlan AR. cyvcf2: fast, flexible variant analysis with Python. Bioinformatics. 2017;. doi: 10.1093/bioinformatics/btx057 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref038] 38. Consortium GP, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68. doi: 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1010281.ref039] 39.Kostenetskiy P, Chulkevich R, Kozyrev V. HPC Resources of the Higher School of Economics. In: Journal of Physics: Conference Series. vol. 1740. IOP Publishing; 2021. p. 012050.

PERMALINK

Estimating the timing of multiple admixture events using 3-locus linkage disequilibrium

Mason Liang

Mikhail Shishkin

Anastasia Mikhailova

Vladimir Shchur

Rasmus Nielsen

Roles

Abstract

Author summary

Introduction

Description of the method

Model

Linkage disequilibrium and local ancestry

Local ancestry covariance functions

Weighted linkage disequilibrium

Algorithm

Missing source population

Fitting the two-pulse model

Verification and comparison

Patterns of 3-locus LD

Fig 1. Predicted weighted LD surfaces from simulations and theory for varying admixture times.

Fig 3. Weighted LD surfaces produced by constant admixture.

Fig 2. Predicted weighted LD surfaces from simulations and theory for varying admixture proportions.

Comparison to two-locus LD measures

Fig 4. Two-locus weighted LD with two admixture events and varying pulse times.

Accuracy of parameter estimates

Fig 5. Accuracy of estimates of T1 (A) and T2 (B), and ALDER estimates of admixture time (C) as a function of other parameters.

Fig 6. Effect of admixture proportion misspecification on the estimated values of T1 and T2.

Applications

Fig 7. Weighted LD surface for Mexican samples with Yoruba as the first source population reference.

Fig 8. Weighted LD surface for Colombian samples with Yoruba as the first source population reference.

Discussion

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

David Balding

Garrett Hellenthal

Roles

Author response to Decision Letter 0

Decision Letter 1

David Balding

Garrett Hellenthal

Roles

Acceptance letter

David Balding

Garrett Hellenthal

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig 5. Accuracy of estimates of T₁ (A) and T₂ (B), and ALDER estimates of admixture time (C) as a function of other parameters.

Fig 6. Effect of admixture proportion misspecification on the estimated values of T₁ and T₂.