Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2024 Feb 8;20(2):e1010836. doi: 10.1371/journal.pgen.1010836

TRAILS: Tree reconstruction of ancestry using incomplete lineage sorting

Iker Rivas-González 1,*, Mikkel H Schierup 1, John Wakeley 2, Asger Hobolth 3
Editor: Pier Francesco Palamara4
PMCID: PMC10880969  PMID: 38330138

Abstract

Genome-wide genealogies of multiple species carry detailed information about demographic and selection processes on individual branches of the phylogeny. Here, we introduce TRAILS, a hidden Markov model that accurately infers time-resolved population genetics parameters, such as ancestral effective population sizes and speciation times, for ancestral branches using a multi-species alignment of three species and an outgroup. TRAILS leverages the information contained in incomplete lineage sorting fragments by modelling genealogies along the genome as rooted three-leaved trees, each with a topology and two coalescent events happening in discretized time intervals within the phylogeny. Posterior decoding of the hidden Markov model can be used to infer the ancestral recombination graph for the alignment and details on demographic changes within a branch. Since TRAILS performs posterior decoding at the base-pair level, genome-wide scans based on the posterior probabilities can be devised to detect deviations from neutrality. Using TRAILS on a human-chimp-gorilla-orangutan alignment, we recover speciation parameters and extract information about the topology and coalescent times at high resolution.

Author summary

DNA sequences can be compared to reconstruct the evolutionary history of different species. While the ancestral history is usually represented by a single phylogenetic tree, speciation is a more complex process, and, due to the effect of recombination, different parts of the genome might follow different genealogies. For example, even though humans are more closely related to chimps than to gorillas, around 15% of our genome is more similar to the gorilla genome than to the chimp one. Even for those parts of the genome that do follow the same human-chimp topology, we might encounter a last common ancestor at different time points in the past for different genomic fragments. Here, we present TRAILS, a new framework that utilizes the information contained in all these genealogies to reconstruct the speciation process. TRAILS infers unbiased estimates of the speciation times and the ancestral effective population sizes, improving the accuracy when compared to previous methods. TRAILS also reconstructs the genealogy at the highest resolution, inferring, for example, when common ancestry was found for different parts of the genome. This information can also be used to detect deviations from neutrality, effectively inferring natural selection that happened millions of years ago. We validate the method using extensive simulations, and we apply TRAILS to a human-chimp-gorilla multiple genome alignment, from where we recover speciation parameters that are in good agreement with previous estimates.

Introduction

Orthologous sites in two or more sequences share a unique genealogical history, with coalescent events happening at certain time points in the past. In the absence of recombination, all sites along the sequences follow the same genealogy. In reality, however, ancestral recombination events might have decoupled consecutive sites, generating an array of segments with different yet correlated genealogies, collectively known as the ancestral recombination graph (ARG) [1, 2]. In principle, if inferred accurately, the ARG contains all available information about the demography of the samples, and it can be used to estimate population parameters (such as the recombination rate and the ancestral effective population sizes), historical events (such as introgression and hybridization), and selective processes [3]. The ARG, however, is challenging to infer because the underlying genealogies along the genome alignment cannot be directly observed. Instead, inference of the genealogy along the genome relies on the site patterns of the accumulated mutations.

The ARG can also be formulated as a spatial process along the genomic alignment [4]. This process, however, contains a long-range correlation structure because if two recombination events happen flanking a genomic fragment, the fragment might be surrounded by the exact same genealogy. However, disregarding the fact that the process is non-Markovian in nature, the ARG can be approximated by a hidden Markov model (HMM), where the genealogy of a certain genomic position only depends on the genealogy of the previous position [5, 6]. It has been shown that this approach, commonly referred to as sequentially Markovian coalescent or SMC, is a good approximation of the true coalescent-with-recombination process [7]. Perhaps the simplest of such models is the pairwise sequentially Markovian coalescent (PSMC) [8], in which the ARG between two sequences (typically, the two copies of a diploid individual) is modelled. Here, the hidden states are coalescent events that happen in discretized time intervals, which correspond to two-leafed gene trees (Fig 1A). The transition probabilities between pairs of hidden states can be calculated using standard coalescent theory, parameterized by the recombination rate and the ancestral effective population sizes (Ne) in each time interval [8]. PSMC, and other SMCs, such as MSMC [9], MSMC2 [10], ASMC [11], and SMC++ [12], allow the use of standard HMM machinery to infer population parameters, and are thus also useful for inferring the most plausible coalescent times from the posterior decoding. However, SMC models are generally restricted to a single coalescent event between a pair of samples, which limits their usefulness. More recently, there have been new developments to model multiple samples explicitly. For example, ARGweaver [13], Relate [14], tsinfer+tsdate [15, 16] or ARG-Needle [17] use techniques such as resampling, threading and mathematical approximations to sequentially build the ARG [18].

Fig 1. TRAILS is a HMM that reconstructs the time-resolved multi-species ARG for three genomes.

Fig 1

TRAILS extends the isolation model (B) [19] to three species, by combining the time discretization of PSMC-like models (A) [8] with the topologically aware hidden states of CoalHMM (C) [20, 21]. The resulting hidden states of TRAILS (D) are three-leaved genealogies with two discretized coalescent events and one of four possible topologies. A full list of the 27 possible hidden states when nAB = nABC = 3 can be consulted in Fig K in S1 Text.

These models are typically used to analyze samples from the same species to get within-species information about the ancestral process. Analyzing inter-species coalescent events adds another layer of complexity, since the coalescent events need to be contained within the underlying phylogeny or speciation tree [2224]. Moreover, the models described above typically use the presence or absence of a certain mutation to construct haplotypes, but ignore or filter out instances where more than two alleles are observed. This infinite sites model poses a problem for inter-species analysis, because recurrent mutation is more likely to happen, generating instances of sites that have experienced more than a single mutation [25, 26].

Some other models have tried to extend these concepts for the analyses of multiple species. For example, the coalescent-with-isolation model [19] is conceptually similar to PSMC, but, backwards in time, the two analyzed samples are kept isolated until the speciation event, after which they can coalesce (Fig 1B). This model can be used to estimate the speciation time between the two samples and the Ne of the ancestral species, and an extension of it can be used to model isolation-with-migration [27]. These models, similar to SMCs, can output a posterior decoding of the coalescent times.

Beyond two samples, CoalHMM models the coalescent with recombination of three species [20, 21], where the hidden states are the four possible genealogies that might arise within the underlying species tree (Fig 1C). Two of the four genealogies differ from the species tree, which generate incongruencies that might pose a problem for standard phylogenetic reconstruction. Nevertheless, this phenomenon, commonly referred to as incomplete lineage sorting or ILS, is very informative about the demographic parameters of the underlying species tree, and CoalHMM can thus be used to estimate ancestral Ne and two speciation times. Moreover, CoalHMM uses a substitution model for mutations, so recurrent mutations are allowed. However, unlike SMCs, CoalHMM does not model coalescent events at discretized time intervals and, instead, coalescent times are modelled as single time points within an individual branch. Because of this, some of the parameter estimates of CoalHMM are biased [21], and, although obtaining accurate estimates is still possible [28], the debiasing procedure involves costly coalescent simulations. Moreover, posterior decoding can only be performed on the topology of the gene trees, and not on the coalescent times.

Here we present TRAILS, an HMM that combines modelling the information-rich ILS signal in the style of CoalHMM and the time discretization of SMC-like models to infer unbiased estimates of the demographic parameters (ancestral Ne and speciation times), and to enable the posterior decoding of both topology and coalescent times. In TRAILS, the hidden states are three-leaved gene trees, each with a specified topology and two coalescent events that happen at discretized time intervals on an underlying speciation tree (Fig 1D and Fig K in S1 Text). The genealogies are rooted by a fourth sample from an outgroup species. The transition probabilities between the hidden states of TRAILS are calculated using coalescent-with-recombination theory for one, two and three lineages that segregate within the branches of the phylogeny. We provide formulas in matrix notation to calculate these transition probabilities for a varying number of discretized time intervals (see Methods for a short explanation, and S1 Text for an in-depth description of the theory). The emitted states are sites in a four-way multiple genome alignment, containing the sequences of the three species and the outgroup. The transition and emission probabilities are parameterized by two ancestral Ne, speciation times, and the recombination rate. Keeping the mutation rate at a fixed value, TRAILS allows for the estimation of the other parameters by optimizing the HMM likelihood given the alignment. After fitting the HMM, TRAILS can perform posterior decoding of the hidden states, inferring a posterior probability of coalescent events through time within the speciation tree.

Here we derive the transition and emission probabilities, implement the model and demonstrate its use on simulated and real data. After optimizing the population parameters using TRAILS on a simulated dataset, we show that increasing the number of discrete coalescence intervals reduces the bias in the parameter estimation. We also show how the posterior decoding can accurately reconstruct the true ARG, by inferring the topology of gene trees and the time in which coalescent events occurred. We perform additional simulations to show that the posterior decoding of TRAILS can be used to detect selective sweeps that happened on ancestral branches of the phylogeny. Finally, we analyze a human-chimp-gorilla-orangutan alignment, inferring the demographic parameters of the underlying species tree and performing genome-wide posterior decoding at the base-pair level.

Results

Parameter estimation

The transition and emission probabilities of coalescent hidden Markov models (HMMs) are parameterized by the demographic model, i.e., by the speciation times, ancestral effective population sizes (Ne) and recombination rate (ρ). This means that using standard HMM algorithms, these parameters can be optimized to obtain the model that best explains the observed data. Both in CoalHMM and in TRAILS, numerical optimization is performed on the log-likelihood calculated using the forward algorithm, given a four-way genome alignment of three focal species and an outgroup. The maximum likelihood estimates of the demographic model are then found using a bound-constrained search algorithm that optimizes the likelihood function by evaluating it directly.

Previous work using coalescent HMMs has shown that the estimation of the demographic parameters is challenging. In CoalHMM, for example, the parameter estimates are highly biased [21], especially for the ancestral Ne and the recombination rate. It is possible to obtain close to unbiased estimates, but this requires a costly simulation procedure [28]. The source of the bias seems to be the restrictive state space of CoalHMM [21], which includes the topology of the genealogy but no information on when the coalescents happened within each branch of the tree (Fig 1C).

TRAILS overcomes this issue by extending the state space to include coalescent events that can happen in discretized time intervals. To demonstrate that TRAILS can perform unbiased parameter estimation, we generated twenty 10-Mb four-way alignments using msprime [29] by choosing a demographic model similar to the human-chimp-gorilla-orangutan speciation tree (see Methods, Simulations for details). The simulated sequence alignments were analyzed using TRAILS, estimating the times, ancestral Ne values and recombination rate depicted in Fig 2A. Parameters were estimated for nAB = nABC = 1 and nAB = nABC = 5, where nAB is the number of intervals between speciation events and nABC is the number of intervals deep in time, in the common ancestor of all three species.

Fig 2. Increasing the complexity of TRAILS reduces the bias of the estimated parameters.

Fig 2

(A) Diagram (not to scale) of the demographic model with all the optimized parameters in blue for the non-ultrametric case. In an ultrametric model, t1 would correspond to the time from present to the shallowest speciation event, where t1 = tA = tB = tCt2. (B) Relative error of parameter values estimated from 20 simulated msprime genomes for nAB = nABC = 1 (in pink) and nAB = nABC = 5 (in green). Each independent run corresponds to a dot, vertical lines are median values, and vertical lines correspond to interquartile ranges. Units are normalized as (estimated-true)/true to ease comparison across parameters.

In the model where nAB = nABC = 1, which is equivalent to the original CoalHMM model, parameter estimates deviate from their true values in the simulations, especially for t2, NAB and ρ (Fig 2B, in pink). Using a larger number of intervals (nAB = nABC = 5) improves the accuracy of the parameter estimation (Fig 2B, in green). An exception to this is the recombination rate, which is still underestimated (albeit less so), possibly due to recombination events that produce small changes in the coalescent tree (e.g., not changing in the topology and only moving the coalescent times by a few generations) and are thus missed from the sequence data. Generally, however, the simulation results demonstrate that the source of the bias in the parameter estimation in CoalHMM can be alleviated with a more flexible model that includes coalescent times at discretized time intervals.

Posterior decoding of simulated data

Posterior decoding using the parameters estimated from the alignment can be performed using the transition and emission probabilities computed by TRAILS for a specific demographic model. In contrast to other coalescent-based HMMs, the resulting posterior probabilities are, however, hard to visualize, since each hidden state will have its own topology, and first (or more recent) and second (or more ancient) coalescent time intervals (Fig 1D and Fig K in S1 Text). To overcome this, we summarize the posterior probabilities by grouping states that share certain features. For example, the posterior probabilities of all states that share the same topology can be summed. Similarly, the posteriors of all states with the same first or second coalescent times can also be summed.

In order to have a ground truth for comparison, the posterior decoding was performed on 100 kb of an alignment simulated using msprime, with a demographic model identical to that used in Fig 2B (see Methods for a full description of the model). The resulting posteriors can capture the true topology and the second coalescent time quite accurately (Fig 3A and 3B, respectively), while the first coalescent time (Fig 3C) is harder to estimate. Additionally, V1 segments (Fig 3A) are easily misclassified as V0 segments, since V0 and V1 only differ in branch lengths but not in topology (see Fig 3 and Fig I in S1 Text), and, thus, the emitted site patterns for V0 and V1 are similar.

Fig 3. Posterior probability for the topology (A), second coalescent event (B) and first coalescent event (C) of a 100 kb msprime simulation.

Fig 3

“First” and “second” refer to the order in which coalescent events happen, backwards in time. The true empirical topology and coalescent times are plotted as green lines.

Posterior decoding from simulated data with selection

To showcase how the posterior decoding could be useful to infer deviations from the standard coalescent, we simulated a 200 kb alignment using SLiM [30] containing a single positively selected variant (2Nes = 175) in position 100 kb of the simulated alignment that arises at the first interval backwards in time, where the first two species merge in the speciation tree (interval S0 in Fig 4A). The site is strongly positively selected, but it lies well within the possible range of values for selection coefficients recorded in humans, with the lactase gene having up to 2Nes = 1000 in some human populations [31]. The demographic parameters were the same as those used for Figs 2 and 3.

Fig 4. The posterior probability can be used to detect deviations from the neutral expectation.

Fig 4

(A) Posterior probability of the second coalescent event for a simulated 200 kb region containing a positively selected variant at position 100 kb that arises in interval S0, represented by a triangle. The true simulated coalescent times are plotted as green horizontal lines. (B) Mean posterior probability for each second coalescent interval (purple), and the empirical true proportion of sites for each interval (green) for 20 simulated replicates with a selective sweep, using the same model as in (A). The theoretical neutral expectation is plotted as a black dashed line, and time intervals are adjusted so that all intervals have equal probability of observing a coalescent event. Continuous vertical lines represent mean values of the simulations. (C) Same as in (B), but for a neutrally evolving region, using the same model as in Fig 3.

Both the true empirical values and the posterior decoding show that there is an overrepresentation of second coalescent events happening in interval S0 (Fig 4A), which is qualitatively different from the neutral case (Fig 3B). The positively selected variant confers a big advantage and is fixed rapidly in the population, with an expected fixation time of 4s(ln(2Nes)+γ-12Nes)=2622 generations or 65,550 years for an Ne = 10, 000 and a generation time of g = 25 years, where γ ≈ 0.577 is Euler’s constant [32]. In contrast, for a neutrally evolving site, the expected fixation time is 4Ne = 40, 000 generations or 1 million years [33]. The effect of such strong selective force is that whatever polymorphism existed at the selected locus is quickly purged from the population, and, with it, linked neutral variants are hitchhiking along, causing a selective sweep. As a result, there is an excess of coalescent events happening in interval S0, which can be discerned from the posterior probabilities.

The signal observed in the posterior decoding can be summarized by computing the mean posterior probability per time interval and comparing it to the theoretical neutral expectation. There is a clear excess of coalescent events estimated to happen at the interval where the beneficial mutation arises (S0 in Fig 4B), although there is also an excess of coalescent events inferred by the posterior in nearby intervals (S1 and S2), where coalescents are misclassified due to close proximity in time. In any case, the pattern observed for the selective sweep in Fig 4B is in stark contrast with the neutral case shown in Fig 4C, where the posterior falls within the expected values. This demonstrates that deviations from neutrality can be inferred using the posterior decoding of TRAILS, and one could devise a windowed genome-wide scan for selection by summarizing the posterior as proposed in Fig 4B and 4C.

Parameter estimation from a HCGO alignment

ILS happens pervasively on the branches of the tree of life, spanning taxonomically diverse groups such as marsupials [34], birds [35, 36], fishes [37], plants [38], and mammals [39, 40], including primates [28, 41, 42]. For example, there is around 32% of ILS in the human-chimp ancestor, with 16% of the genome grouping human and gorilla, and another 16% grouping chimp and gorilla [28]. These estimates were obtained using CoalHMM [21], together with estimates for ancestral Ne and split times, which were debiased using simulations. Here, we apply TRAILS to a 50 Mb human-chimp-gorilla alignment from chromosome 1 with orangutan as an outgroup to infer population genetics parameters and to gain information about the coalescent times and the topology through posterior decoding.

Using MafFilter [43], the alignment was first preprocessed to extract the species of interest (human, chimp, gorilla and orangutan), to merge consecutive alignment blocks, and to filter out small blocks (see Methods for further details). Using biopython [44], 50 Mb were extracted from chr1, namely the region from 25 Mb to 75 Mb. This region was used as the input for TRAILS, choosing the parameter values estimated in Rivas-González et al. [28] as starting values for the optimized parameters, setting nAB = nABC = 3, and using the L-BFGS-B algorithm for model fitting [45, 46], although other bound-constrained method can also be used. To get more accurate parameter estimates, the optimized parameters were used as starting parameters for a second TRAILS run, where nAB = 3 and nABC = 5.

The resulting estimates are displayed in Fig 5A. Assuming a mutation rate of μ = 1.25 × 10−8 per site per generation and a generation time of g = 25 years [47, 48], the speciation time estimates are in good agreement with previously inferred values. Using the human tip branch length, we estimate the time until the HC split at 5.51 million years ago (95% CI: [5.43, 5.54], ∼4–7 MYA from literature [21, 28, 4951]), the HCG split at 10.40 MYA (95% CI: [10.27, 10.40], ∼8–12 from literature [21, 28, 50, 51]), and the HCGO split at 18.55 MYA (95% CI: [18.37, 18.73], ∼10–20 from literature [28, 41, 50]). Moreover, ancestral Ne inferred for the HC ancestor (167,400, 95% CI: [165, 548, 170, 361]) and for the HCG ancestor (101,290, 95% CI: [100, 467, 101, 492]) are consistent with previous estimates using CoalHMM (177,368 and 106,702, respectively [28]). Using the estimates for t2 (in generations) and NAB, we get a probability of ILS equal to

ILS=23exp(-t2/(2NAB))=23exp(-195,000/(2×167,400))=37%,

so our parameter estimates suggest the ((human, gorilla), chimp) topology in 18.5% of the genome and the ((chimp, gorilla), human) topology in 18.5% of the genome. Finally, the recombination rate was estimated to be ρ = 1.19 × 10−8 per site per generation, which matches the rate estimated for present-day humans [52].

Fig 5. TRAILS output for 50 Mb of chromosome 1 of the HCGO alignment.

Fig 5

(A) Estimates for the speciation times (green) and ancestral Ne (purple) of the speciation process, optimized using TRAILS and assuming a mutation rate of μ = 1.25 × 10−8 per site per generation. To convert time from generations to millions of years, a generation time of g = 25 years per generation was used. (B) Genome-wide variation of ILS, and first and second coalescent times. (C) Posterior decoding of the topology, and first and second coalescent events for a zoomed-in region in chromosome 1. As in Fig 3, both V0 and V1 correspond to the species topology (((H,C),G),O);, V2 corresponds to (((H,G),C),O);, and V3 to (((C,G),H),O);. The LDLRAD1 gene is plotted on top, where exons are represented as boxes, coding regions as filled boxes, and introns as horizontal lines.

TRAILS allows for the independent estimation of each individual branch length, which is useful for non-ultrametric trees. Fig 5A shows that the branch leading to chimps is longer than that leading to humans by around 5.9%, and the gorilla branch is longer than both the human (12.6%) and the chimp (9.1%) branches (calculated from the second speciation event to present). This deviation from the molecular clock is well supported by previous studies [53], and is likely because of different branches accumulating a different number of mutations per year, either due to an acceleration or deceleration of the mutational process, changes in the average time of reproduction, or a combination of these [5456].

In summary, we have demonstrated that TRAILS is able to infer demographic parameters that are in agreement with estimates from the literature. More importantly, it does so without the need for any post-processing or corrections, avoiding the use of fossil calibrations [50] or debiasing procedures [21, 28].

Posterior decoding of the HCGO alignment

Posterior decoding was then performed using the optimized parameters and setting nAB = nABC = 5. In order to get an understanding of the genome-wide variation of ILS and coalescent times, the resulting posterior probabilities were summarized in 100 kb windows along the chr1 region in three different ways. First, the mean posterior probability was calculated for each of the four possible topologies, by first summing the posteriors of all hidden states sharing the same topology for each site, and then averaging over all sites on the 100 kb window. The resulting probabilities were then used to calculate a proxy for ILS, by summing the probabilities of observing the ILS topologies (V2 and V3 in Fig 3). Second, using a similar procedure, the mean posterior probability was calculated for each of the six possible intervals for the first coalescent event, by first summing the posteriors of all hidden states sharing the same first coalescent interval for each site, and then averaging over all sites in the 100 kb window. As a proxy for the first coalescent time, integers from 1 to 6 were assigned to each interval in chronological order backwards in time, and a weighted mean of those integers was computed, where the weights were the mean posterior probabilities per window. Third, the same quantity as for the first coalescent was computed for the second coalescent.

After filtering outliers smaller than the 1st percentile and larger than the 99th percentile, these proxies were plotted as heatmaps (Fig 5B). The first coalescent and the ILS proportion show a very strong correlation (ρ = 0.979), likely reflecting that V2 and V3 can only happen in the common ancestor of all three species, so when ILS is present, coalescent times are generally deeper (and vice versa). This signal is also captured, although more weakly, by the correlation between the second coalescent and the ILS proportion (ρ = 0.483). This can be explained by knowing that, conditional on ILS, the second coalescent follows a convolution of exponentials of rates 3 and 1 [57], while, conditional on V0 (i.e., conditional on the first coalescent happening between speciation events), the second coalescent simply follows an exponential or rate 1. Thus, if more ILS is present in a certain window, then, on average, the second coalescent will tend to happen deeper in time.

Fig 5B also shows how, at the 100 kb level, the genome displays spatial covariation in the amount of ILS and the time to coalescence that exceeds stochastic effects of a neutral coalescent process. This is in line with previous results [28], where ILS proportions are affected by genomic features such as gene density, recombination rate, and the effects of linked selection.

A zoomed-in region of around 41 kb is shown in Fig 5C, which shows a long fragment of the ((chimp, gorilla), human) topology (V3). This fragment is unusually long, spanning 6,800 bp, and it is highly implausible following the demographic model inferred by TRAILS. Thanks to the posterior decoding of the coalescent times performed by TRAILS, we can observe that, for this fragment, the first coalescent event backwards in time happens close to the second speciation time (in interval F5), while the second coalescent event happens in the deepest time interval (S4).

One explanation for such a long V3 fragment is that it might be influenced by selection, which would maintain the alternative topology uninterrupted for a long period of time. Another explanation could be that this fragment is introgressed, especially given that the first coalescent event is shallow and that the fragment is long. For comparison, another region in chromosome 1, which also shows an excess of V3 topology, has a much more variable distribution of coalescent times, and it is more fragmented (Fig Q in S1 Text). Such detailed information about the timing of coalescent events is only possible thanks to the time discretization of TRAILS, and these details would have been missed, for example, in the posterior decoding of CoalHMM (recall Fig 1C).

The V3 fragment in Fig 5C overlaps with the last exon of the LDLRAD1 gene, which codes for a lipoprotein receptor (UniProt: Q5T700). LDLRAD1 does not show signals of positive selection in hominids (based on dN/dS values from Rivas-González et al. [28]). Additionally, this gene is not particularly constrained in primates, as measured by PhastCons [58] and PhyloP [59], and it is not enriched for repeat elements, as retrieved from the UCSC Genome Browser [60]. While we were unable to point out a specific cause for the pattern observed for LDLRAD1, Fig 5C showcases how TRAILS can be used to infer the topology and coalescent times of protein-coding genes at the base-pair level across millions of years of evolution. Comparing the posterior decoding with genomic covariates can reveal selective processes affecting the sorting of lineages [28] or solve cases of phenotypic hemiplasy [34].

Discussion

Coalescent-based approaches for analyzing genomic data are essential tools for understanding the ancestral history of species. Here, we have introduced TRAILS, an HMM that models the topology and the two coalescent events for gene genealogies within a phylogeny of three species. TRAILS can accurately infer population genetic parameters (ancestral Ne, speciation times and recombination rate). From the posterior decoding, the three-species ARG can be inferred at the base-pair level, providing insight into the ancestral history of the species at high resolution. Deviations from neutrality can be detected by summarizing the posterior decoding in windows and running genomic scans to find excess of coalescents happening at certain time intervals, such as proposed in Fig 4. Moreover, more coarse-grained summaries of the posterior decoding spanning several kilobases could be used to infer the genome-wide variation of ILS or coalescent times, potentially revealing correlations with other genomic features such as variation in the recombination rate or selection [28, 61].

As demonstrated here, the posterior decoding from TRAILS is a powerful way to infer details of the ARG in the context of speciation, together with departures from the neutral expectation. Recurrent selective sweeps that have happened during the speciation process are hypothesized to be drivers of speciation, and to greatly influence the genealogical landscape of present-day genomes. For example, the human X chromosome contains long haplotypes shared across all non-African populations [62], spanning large genomic regions that are both lacking Neanderthal introgression [63], and showing very low rates of ILS in the human-chimp ancestor [64]. This suggests that the X chromosome has a unique evolutionary history which is greatly affected by gene flow (or lack thereof), and that these low-diversity regions might be related to genetic incompatibilities that arose during the speciation of ancestral hominids. TRAILS can help locate these ancient sweeps and infer when they occurred, potentially illuminating when and how genomes were affected by selection during the speciation process.

Using posterior decoding, regions that show unusually high levels of an alternative topology with very shallow coalescents can also be detected, which could indicate ancient introgression or hybridization events happening between ancestral branches of the species tree. Such ancient introgression events have been reported to be pervasive among some branches in the primate species tree [50], although they can be difficult to distinguish from ILS [65] unless explicitly modelled. TRAILS could be extended to model introgression more directly by including additional hidden states representing introgressed genomic fragments. These would have exceedingly short coalescent times compared to the deep coalescent ILS states [28], and TRAILS provides the mathematical framework to distinguish between these two cases.

TRAILS could also be extended to accommodate variation in Ne along individual ancestral branches in the species tree, conceptually very similar to what is done in PSMC analyses from a single extant genome [8]. Modelling variation in Ne can elucidate how speciation events might have happened. For example, population sizes that are maintained more constant during the speciation event might indicate a cleaner split, while increased ancestral Ne just prior to the estimated time of speciation (here equalled to the total cessation of gene flow) might point to a period of elevated population structure and a prolonged species separation with migration [27]. Modelling changes in the demography around speciation events might also help us detect and characterize instances of complex speciation, as proposed, for example, by Patterson et al. [49].

The current implementation of TRAILS for calculating the transition probability matrix of the HMM is restricted to three species and a relatively few number of hidden states (see Fig P and section 9 in S1 Text for a discussion on the running times). With more efficient algorithms, future extensions of TRAILS could be devised to analyze more than three species, thus allowing for the inference of the speciation tree and the multi-species ARG for more taxa. Based on the parameters estimated by TRAILS for the HCGO alignment (Fig 5A), the proportion of ILS between humans (or chimps), gorillas and orangutans would be around 13%. While this violates one of the assumptions of TRAILS, which is that there should be inappreciable ILS between the outgroup and the rest of the analyzed species, it also showcases the need for models that are able to accommodate more species (see subsection 2.4 in S1 Text). This could help us resolve more complex patterns of ILS, which include phenomena such as anomaly zones [36, 66, 67].

Methods

The transition probabilities between the hidden states of TRAILS can be calculated from a series of interconnected continuous-time Markov chains (CTMCs) that model the coalescent with recombination of two contiguous nucleotides for one, two or three sequences. The CTMCs are parameterized by the ancestral Ne, speciation times and recombination rate. The transitions for TRAILS are subsequently calculated by conditioning the CTMCs on the topology and coalescent times of the gene trees at those two sites, binning coalescent events into discretized time intervals along the speciation process. Additionally, the emission probabilities for each hidden state are calculated from a CTMC of the mutational process by choosing a certain substitution model. In this section, we provide a summary of the model, and the full explanation can be consulted in S1 Text.

Continuous-time Markov chains for the ancestral process

The coalescent with recombination between two sites can be approximated as a continuous-time Markov chain (CTMC). For one sequence, the left and the right sites can be either linked or unlinked, so there are only two possible states for the CTMC. Two linked sites become unlinked when a recombination event happens between them, which happens with a rate of ρ1. On the other hand, the unlinked left and the right sites become linked when a coalescent event happens between them, with a rate of γ1. These two transitions can be gathered in a transition rate matrix

Q1=(-γ1γ1ρ1-ρ1). (1)

From this transition rate matrix, we can calculate the probability matrix PA as exp(tQ1), which gives the probability of the sites being unlinked or linked at a certain time t given that the chain starts in the unlinked state (first row) or unlinked state (second row).

When two sequences are involved, the state-space of the CTMC becomes more complex. Apart from the coalescent and the recombination events described above, sites can also coalesce irreversibly backwards in time with rate γ2, which happens when two left (or two right) sites from two different sequences find common ancestry. The resulting rate matrix (Fig E in S1 Text) for the coalescent with recombination with two sequences then corresponds to a CTMC with 15 states (Fig D in S1 Text), which was originally described by Simonsen and Churchill [68]. Due to these irreversible coalescent events, the rate matrix has a block-like structure, and it contains sets of states in which sequences can freely recombine and coalesce until an irreversible coalescent event occurs. Ultimately, both the left and the right sites will have irreversibly coalesced, reaching one of two absorbing states. Note that the matrix is quite sparse, since most of the transitions are not allowed.

Following a similar reasoning, the coalescent with recombination for three sequences can also be modelled as a CTMC. In this case, both the left and the right site will eventually undergo two irreversible coalescent events, which can potentially happen in any order between the three sequences. This creates 203 possible states (Fig G in S1 Text), the transitions of which can also be gathered in a rate matrix (Fig H in S1 Text). This matrix also has a block-like structure, and, given enough time, states will transition into one of the two absorbing states.

If all three sequences belonged to the same species, a three-sequence CTMC would be sufficient to model the coalescent with recombination. However, the sequences belong to three different species, so the speciation process has to be overlaid on top. Subsequently, the coalescent with recombination along the speciation tree is modelled as a series of interconnected CTMCs.

Because sequences are sampled in present time, the left and the right sites are fully linked at time 0, meaning that the starting probability vector for the one-sequence CTMCs is (0, 1). Backwards in time, each of the sequences will remain isolated for a certain period of time in which the two sites can recombine and coalesce freely. The sequences for species A and B will remain isolated until the first speciation event at time tA and tB, respectively. Then, the final probabilities of the one-sequence CTMCs for A and B are merged to create the initial probabilities for the two-sequence CTMC. After a certain time, where the two sequences are allowed to coalesce and recombine, the final probabilities for the two-sequence CTMC and the final probabilities for the one-sequence CTMC of species C will be merged, thus creating the starting probabilities for the three-sequence CTMC. Finally, given enough time, all sequences for both the right and the left site will eventually coalesce into one of the two absorbing states of the last CTMC.

Transition probabilities of the HMM

The hidden states of TRAILS are genealogies which include a topology and two coalescent events that can happen within discretized time intervals. The breakpoints of the time intervals can thus be used to transform the CTMCs into a discrete-time Markov chain (DTMC). First, the joint probability of observing the genealogies and the left and the right loci can be computed by careful bookkeeping of the appropriate paths within the CTMC, defined by the corresponding genealogies and the discretized time intervals. The transition probability matrix of the DTMC (and the HMM) can then be obtained upon dividing the joint probability by the discretized marginals. A detailed description of these derivations is given in S1 Text.

Emission probabilities of the HMM

For each hidden state, the emission probabilities are calculated using the Jukes-Cantor mutational model [69]. Instead of calculating the emitted nucleotides for the three species only, TRAILS also includes the nucleotides emitted by an outgroup, which provides essential information about the ancestral state in each site. This additional species must have a sufficient divergence with the rest of the species such that ILS can be neglected between them.

Parameterization

The transition and emission probabilities are parameterized by the speciation times (tA, tB, tC, t2, tupper), the effective population sizes (NAB, NABC), and the recombination rate (Fig 2A). Implicitly, TRAILS is also parameterized by the mutation rate, but this cannot be jointly inferred with the rest of the parameters because the parameter values can be scaled by any factor and still produce the same coalescent model [57, 70]. Instead, the mutation rate in the model is fixed to 1, and all other parameters are rescaled appropriately. The resulting units for the speciation times are number of generations multiplied by the mutation rate, and, similarly, the effective population sizes are number of individuals times the mutation rate. Accordingly, the recombination rate is divided by the mutation rate, so the optimized parameter is the ratio between the recombination and the mutation rate. After estimating the parameters, parameter values with more interpretable units can be obtained by choosing an appropriate mutation rate.

TRAILS allows for two different parameterizations, namely the ultrametric model and the non-ultrametric model. In the ultrametric model, all sequences are sampled at time 0, so the molecular clock is assumed (tA = tB = tCt2 = t1). Instead, in the non-ultrametric model, each sequence (A, B and C) is allowed to be sampled at a different time. This is useful to model deviations from the molecular clock, for example, when the number of generations for each of the species from present time until the speciation event is different, or the mutation rate varies between the species.

Simulations

Parameter estimation

Simulations to validate the model were performed in msprime [29]. The underlying demographic model follows a speciation tree with four species, namely A, B, C and D. The time from present to the first speciation event was set to t1 = 200, 000 generations. The (haploid) ancestral Ne for the time between speciation events was set to NAB = 80, 000. In order to keep an ILS proportion of 32%, the time between the first and the second speciation events was set to t2=-NABlog(32×0.32)=25,501 generations. The (haploid) ancestral Ne earlier than the second speciation event was set to NABC = 70, 000, and the time between the second speciation event and the speciation event with the outgroup was set to t3 = 1, 000, 000 generations. The recombination rate was set to ρ = 0.5 × 10−8 per site per generation. The tree was kept ultrametric in number of generations, meaning that all species were sampled at generation 0, so tA = tB = t1, tC = t1 + t2, and tD = t1 + t2 + t3. Mutations were then added on top of the simulated genealogies according to the Jukes-Cantor model [69] with a mutation rate of μ = 1.5 × 10−8 per generation per site.

In order to investigate how the number of intervals in the AB-ancestor (nAB) and the ABC-ancestor (nABC) affect the parameter estimation, twenty 10-Mb alignments were simulated using msprime, and then TRAILS was run to estimate the demographic parameters for nAB = nABC = 1 and for nAB = nABC = 5, using the bound-constrained Nelder-Mead algorithm. The starting parameters of the optimization were randomly drawn from a normal distribution centered on the true value and with a standard deviation of the true value divided by 5. Convergence was achieved at around 150 iterations, with a runtime of around 10 hours for nAB = nABC = 5 per 10-Mb region (see section 9 in S1 Text for further details on the runtime of the model).

Posterior probability

The demographic model described above was also used to generate a 100 kb alignment to perform posterior decoding with the true parameters fixed. In TRAILS, the default way of dividing up the coalescent space into discretized time intervals is by taking quantiles of a truncated exponential of rate 1 (measured in units of NAB) for the time between speciation events, while the time previous to the earliest speciation event is divided following the quantiles of an exponential of rate 1 (measured in units of NABC). This scheme is appropriate for computing the posterior decoding of the first coalescent event when it happens between speciation events, since it is expected to follow exactly a truncated exponential of rate 1 according to the standard coalescent. However, if the first coalescent happens deep in time, then the coalescent times will follow an exponential with rate 3, since there are 3 lineages present. This means that most coalescent events happen very fast, and there would be an overrepresentation of coalescents in the first interval if the default cutpoint scheme is used.

TRAILS is, however, not restricted to a specific discretization, and it can compute the transition and emission probabilities, and, thus, perform posterior decoding, for user-specified intervals. For the first coalescent in deep time, posterior decoding was performed using cutpoints from the quantiles of an exponential with rate 3, with nABC = 7 and the true parameters from the msprime simulation.

Moreover, the second coalescent event will follow a mixture of an exponential with a rate of 1 (for the V0 states) and a convolution of two exponentials with rates 3 and 1 (for the deep coalescent states) [57], which will happen with probability 1 − exp(−t2/NAB) and exp(−t2/NAB), respectively. Here, t2 is the time between speciation events in number of generations, and NAB is the effective population population size. The second coalescent time can thus be represented as a phase-type distribution [71], with sub-intensity matrix S and initial probability vector π, such that

S=(-1000-3300-1)

and

π=(1-exp(-t2NAB),exp(-t2NAB),0).

Therefore, posterior decoding for the second coalescent was performed using the quantiles of this phase-type distribution using PhaseTypeR [72], with nAB = 5, nABC = 7, and the true parameters used to generate the msprime simulation.

Selection

Using the same demographic model, a 200-kb alignment was simulated using SLiM [30], assuming a single positively selected variant in the middle of the region, with population-scaled selection parameter 2Nes = 175. Since SLiM is a forward simulator and runs much slower than backward simulators such as msprime, all the demographic parameters of the model were rescaled by a factor of 200 in order to increase computational speed. Posterior decoding was performed on the resulting alignment simulated using the same discretization scheme as described above, and the resulting posterior probabilities are plotted in Fig 4A.

To showcase how the posterior of TRAILS can be used as a test to detect deviations from neutrality, SLiM was used to generate twenty 200-kb with and without a selected variant, and TRAILS was run afterward to calculate the posterior probability for the second coalescent time in 7 discretized intervals. The signal of the posterior decoding was summarized as the mean posterior probability for each discretized time interval, plotted in Fig 4B and 4C.

Real data

The chromosome 1 multiz alignment of 30 mammalian species (27 primates) was downloaded from the UCSC Genome Browser database in MAF format. Using MafFilter [43], the species of interest were filtered (human, chimp, gorilla and orangutan), syntenic blocks separated by 200 nucleotides or less were merged using human as a reference, and blocks smaller than 2,000 bp were filtered out. The resulting filtered MAF was used as input for TRAILS, using the parameters estimated in Rivas-González et al. [28] as starting values. The optimization was performed using a bound-constrained version of the L-BFGS-B algorithm implemented in numpy [45, 46, 73], by setting nAB = nABC = 3, and using the L-BFGS-B algorithm for model fitting. To get a more accurate parameter estimation, the optimized estimates were used as starting values for a second TRAILS run where nABC = 5, optimized using a bound-constrained Nelder-Mead algorithm [74, 75], which showed better convergence for already-optimized TRAILS runs.

Confidence intervals for the estimated parameters were computed using parametric bootstrapping. 20 replicates of 50-Mb regions were simulated from the model fitted with the estimated parameters. Afterward, TRAILS was run on the simulated regions to get optimized parameters. For each parameter, a normal distribution was fitted for the 20 replicates, and the 95% confidence intervals were calculated from the fitted normal (Fig R and Table B in S1 Text).

Supporting information

S1 Text. Supplementary notes, including Figs A to S, and Tables A and B.

S1 Text contains a detailed description of the theoretical framework and implementation of TRAILS, together with supplementary analyses.

(PDF)

pgen.1010836.s001.pdf (2.9MB, pdf)

Acknowledgments

We gratefully acknowledge Julien Dutheil and Nick Patterson for useful discussions on the implementation of the model. We also thank GenomeDK for providing the computational resources for performing the analyses.

Data Availability

The python package for TRAILS can be downloaded and installed from pip (https://pypi.org/project/trails-rivasiker/), and the source code can be browsed at https://github.com/rivasiker/trails. The code for reproducing the figures in the manuscript can be found at https://github.com/rivasiker/trails_paper.

Funding Statement

This work was supported by the Novo Nordisk Foundation (NNF18OC0031004 to MHS) and the Independent Research Fund Denmark, Natural Sciences (6108-00385 to MHS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Griffiths RC. The Two-Locus Ancestral Graph. Lecture Notes-Monograph Series. 1991;18:100–117. doi: 10.1214/lnms/1215459289 [DOI] [Google Scholar]
  • 2. Griffiths RC, Marjoram P. An ancestral recombination graph. In: Donnelly P, Tavaré S, editors. Progress in Population Genetics and Human Evolution (IMA Volumes in Mathematics and its Applications, vol. 87). New York: Springer-Verlag; 1997. p. 257–270. [Google Scholar]
  • 3. Hubisz M, Siepel A. In: Dutheil JY, editor. Inference of Ancestral Recombination Graphs Using ARGweaver. New York, NY: Springer US; 2020. p. 231–266. [DOI] [PubMed] [Google Scholar]
  • 4. Wiuf C, Hein J. Recombination as a point process along sequences. Theoretical Population Biology. 1999;55(3):248–259. doi: 10.1006/tpbi.1998.1403 [DOI] [PubMed] [Google Scholar]
  • 5. McVean GA, Cardin NJ. Approximating the coalescent with recombination. Philosophical Transactions of the Royal Society B: Biological Sciences. 2005;360(1459):1387–1393. doi: 10.1098/rstb.2005.1673 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Marjoram P, Wall JD. Fast “coalescent” simulation. BMC Genetics. 2006;7:1–9. doi: 10.1186/1471-2156-7-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Wilton PR, Carmi S, Hobolth A. The SMC′ is a highly accurate approximation to the ancestral recombination graph. Genetics. 2015;200(1):343–355. doi: 10.1534/genetics.114.173898 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475(7357):493–496. doi: 10.1038/nature10231 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nature Genetics. 2014;46(8):919–925. doi: 10.1038/ng.3015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Malaspinas AS, Westaway MC, Muller C, Sousa VC, Lao O, Alves I, et al. A genomic history of Aboriginal Australia. Nature. 2016;538(7624):207–214. doi: 10.1038/nature18299 [DOI] [PubMed] [Google Scholar]
  • 11. Palamara PF, Terhorst J, Song YS, Price AL. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nature Genetics. 2018;50(9):1311–1317. doi: 10.1038/s41588-018-0177-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Terhorst J, Kamm JA, Song YS. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nature Genetics. 2017;49(2):303–309. doi: 10.1038/ng.3748 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS Genetics. 2014;10(5):e1004342. doi: 10.1371/journal.pgen.1004342 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Speidel L, Forest M, Shi S, Myers SR. A method for genome-wide genealogy estimation for thousands of samples. Nature Genetics. 2019;51(9):1321–1329. doi: 10.1038/s41588-019-0484-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Kelleher J, Wong Y, Wohns AW, Fadil C, Albers PK, McVean G. Inferring whole-genome histories in large population datasets. Nature Genetics. 2019;51(9):1330–1338. doi: 10.1038/s41588-019-0483-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Wohns AW, Wong Y, Jeffery B, Akbari A, Mallick S, Pinhasi R, et al. A unified genealogy of modern and ancient genomes. Science. 2022;375(6583):eabi8264. doi: 10.1126/science.abi8264 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Zhang BC, Biddanda A, Gunnarsson ÁF, Cooper F, Palamara PF. Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits. Nature Genetics. 2023; p. 1–9. doi: 10.1038/s41588-023-01379-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Brandt DYC, Wei X, Deng Y, Vaughn AH, Nielsen R. Evaluation of methods for estimating coalescence times using ancestral recombination graphs. Genetics. 2022;221(1):iyac044. doi: 10.1093/genetics/iyac044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Mailund T, Dutheil JY, Hobolth A, Lunter G, Schierup MH. Estimating divergence time and ancestral effective population size of Bornean and Sumatran orangutan subspecies using a coalescent hidden Markov model. PLoS Genetics. 2011;7(3):e1001319. doi: 10.1371/journal.pgen.1001319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Hobolth A, Christensen OF, Mailund T, Schierup MH. Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genetics. 2007;3(2):e7. doi: 10.1371/journal.pgen.0030007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Dutheil JY, Ganapathy G, Hobolth A, Mailund T, Uyenoyama MK, Schierup MH. Ancestral population genomics: the coalescent hidden Markov model approach. Genetics. 2009;183(1):259–274. doi: 10.1534/genetics.109.103010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Degnan JH, Rosenberg NA. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends in Ecology & Evolution. 2009;24(6):332–340. doi: 10.1016/j.tree.2009.01.009 [DOI] [PubMed] [Google Scholar]
  • 23. Rannala B, Leache A, Edwards S, Yang Z. The multispecies coalescent model and species tree inference. In: Scornavacca C, Delsuc F, Galtier N, editors. Phylogenetics in the Genomic Era. Self Published; 2020. p. 3.3:1–3.3:21. Available from: https://inria.hal.science/PGE/hal-02535622. [Google Scholar]
  • 24. Mirarab S, Nakhleh L, Warnow T. Multispecies coalescent: theory and applications in phylogenetics. Annual Review of Ecology, Evolution, and Systematics. 2021;52:247–268. doi: 10.1146/annurev-ecolsys-012121-095340 [DOI] [Google Scholar]
  • 25. O’hUigin C, Satta Y, Takahata N, Klein J. Contribution of homoplasy and of ancestral polymorphism to the evolution of genes in anthropoid primates. Molecular Biology and Evolution. 2002;19(9):1501–1513. doi: 10.1093/oxfordjournals.molbev.a004213 [DOI] [PubMed] [Google Scholar]
  • 26. Wake DB, Wake MH, Specht CD. Homoplasy: from detecting pattern to determining process and mechanism of evolution. Science. 2011;331(6020):1032–1035. doi: 10.1126/science.1188545 [DOI] [PubMed] [Google Scholar]
  • 27. Mailund T, Halager AE, Westergaard M, Dutheil JY, Munch K, Andersen LN, et al. A new isolation with migration model along complete genomes infers very different divergence processes among closely related great ape species. PLoS Genetics. 2012;8(12):e1003125. doi: 10.1371/journal.pgen.1003125 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Rivas-González I, Rousselle M, Li F, Zhou L, Dutheil JY, Munch K, et al. Pervasive incomplete lineage sorting illuminates speciation and selection in primates. Science. 2023;380(6648):eabn4409. doi: 10.1126/science.abn4409 [DOI] [PubMed] [Google Scholar]
  • 29. Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, et al. Efficient ancestry and mutation simulation with msprime 1.0. Genetics. 2022;220(3):iyab229. doi: 10.1093/genetics/iyab229 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Haller BC, Messer PW. SLiM 4: Multispecies eco-evolutionary modeling. The American Naturalist. 2023;201(5):E000–E000. doi: 10.1086/723601 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Ségurel L, Bon C. On the evolution of lactase persistence in humans. Annual Review of Genomics and Human Genetics. 2017;18:297–319. doi: 10.1146/annurev-genom-091416-035340 [DOI] [PubMed] [Google Scholar]
  • 32. Hermisson J, Pennings PS. Soft sweeps: molecular population genetics of adaptation from standing genetic variation. Genetics. 2005;169(4):2335–2352. doi: 10.1534/genetics.104.036947 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Otto SP, Whitlock MC. Fixation Probabilities and Times. Encyclopedia of Life Sciences. 2013; p. 1–5. [Google Scholar]
  • 34. Feng S, Bai M, Rivas-González I, Li C, Liu S, Tong Y, et al. Incomplete lineage sorting and phenotypic evolution in marsupials. Cell. 2022;185(10):1646–1660. doi: 10.1016/j.cell.2022.03.034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Suh A, Smeds L, Ellegren H. The dynamics of incomplete lineage sorting across the ancient adaptive radiation of neoavian birds. PLoS Biology. 2015;13(8):e1002224. doi: 10.1371/journal.pbio.1002224 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Cloutier A, Sackton TB, Grayson P, Clamp M, Baker AJ, Edwards SV. Whole-genome analyses resolve the phylogeny of flightless birds (Palaeognathae) in the presence of an empirical anomaly zone. Systematic Biology. 2019;68(6):937–955. doi: 10.1093/sysbio/syz019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Alda F, Tagliacollo VA, Bernt MJ, Waltz BT, Ludt WB, Faircloth BC, et al. Resolving deep nodes in an ancient radiation of neotropical fishes in the presence of conflicting signals from incomplete lineage sorting. Systematic Biology. 2019;68(4):573–593. doi: 10.1093/sysbio/syy085 [DOI] [PubMed] [Google Scholar]
  • 38. Zhou Y, Duvaux L, Ren G, Zhang L, Savolainen O, Liu J. Importance of incomplete lineage sorting and introgression in the origin of shared genetic variation between two closely related pines with overlapping distributions. Heredity. 2017;118(3):211–220. doi: 10.1038/hdy.2016.72 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Wang K, Lenstra JA, Liu L, Hu Q, Ma T, Qiu Q, et al. Incomplete lineage sorting rather than hybridization explains the inconsistent phylogeny of the wisent. Communications Biology. 2018;1(1):169. doi: 10.1038/s42003-018-0176-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Scornavacca C, Galtier N. Incomplete lineage sorting in mammalian phylogenomics. Systematic Biology. 2017;66(1):112–120. [DOI] [PubMed] [Google Scholar]
  • 41. Hobolth A, Dutheil JY, Hawks J, Schierup MH, Mailund T. Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widespread selection. Genome Research. 2011;21(3):349–356. doi: 10.1101/gr.114751.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Mailund T, Munch K, Schierup MH. Lineage sorting in apes. Annual Review of Genetics. 2014;48:519–535. doi: 10.1146/annurev-genet-120213-092532 [DOI] [PubMed] [Google Scholar]
  • 43. Dutheil JY, Gaillard S, Stukenbrock EH. MafFilter: a highly flexible and extensible multiple genome alignment files processor. BMC Genomics. 2014;15:1–10. doi: 10.1186/1471-2164-15-53 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–1423. doi: 10.1093/bioinformatics/btp163 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Byrd RH, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing. 1995;16(5):1190–1208. doi: 10.1137/0916069 [DOI] [Google Scholar]
  • 46. Zhu C, Byrd RH, Lu P, Nocedal J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS). 1997;23(4):550–560. doi: 10.1145/279232.279236 [DOI] [Google Scholar]
  • 47. Langergraber KE, Prüfer K, Rowney C, Boesch C, Crockford C, Fawcett K, et al. Generation times in wild chimpanzees and gorillas suggest earlier divergence times in great ape and human evolution. Proceedings of the National Academy of Sciences. 2012;109(39):15716–15721. doi: 10.1073/pnas.1211740109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Wang RJ, Al-Saffar SI, Rogers J, Hahn MW. Human generation times across the past 250,000 years. Science Advances. 2023;9(1):eabm7047. doi: 10.1126/sciadv.abm7047 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Patterson N, Richter DJ, Gnerre S, Lander ES, Reich D. Genetic evidence for complex speciation of humans and chimpanzees. Nature. 2006;441(7097):1103–1108. doi: 10.1038/nature04789 [DOI] [PubMed] [Google Scholar]
  • 50. Vanderpool D, Minh BQ, Lanfear R, Hughes D, Murali S, Harris RA, et al. Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression. PLoS Biology. 2020;18(12):e3000954. doi: 10.1371/journal.pbio.3000954 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, et al. Insights into hominid evolution from the gorilla genome sequence. Nature. 2012;483(7388):169–175. doi: 10.1038/nature10842 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Halldorsson BV, Palsson G, Stefansson OA, Jonsson H, Hardarson MT, Eggertsson HP, et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science. 2019;363(6425):eaau1043. doi: 10.1126/science.aau1043 [DOI] [PubMed] [Google Scholar]
  • 53. Moorjani P, Amorim CEG, Arndt PF, Przeworski M. Variation in the molecular clock of primates. Proceedings of the National Academy of Sciences. 2016;113(38):10607–10612. doi: 10.1073/pnas.1600374113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Besenbacher S, Hvilsom C, Marques-Bonet T, Mailund T, Schierup MH. Direct estimation of mutations in great apes reconciles phylogenetic dating. Nature Ecology & Evolution. 2019;3(2):286–292. doi: 10.1038/s41559-018-0778-x [DOI] [PubMed] [Google Scholar]
  • 55. Thomas GW, Wang RJ, Puri A, Harris RA, Raveendran M, Hughes DS, et al. Reproductive longevity predicts mutation rates in primates. Current Biology. 2018;28(19):3193–3197. doi: 10.1016/j.cub.2018.08.050 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Bromham L. The genome as a life-history character: why rate of molecular evolution varies between mammal species. Philosophical Transactions of the Royal Society B: Biological Sciences. 2011;366(1577):2503–2513. doi: 10.1098/rstb.2011.0014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Wakeley J. Coalescent Theory: An Introduction. 1st ed. Roberts & Company Publishers; 2008. [Google Scholar]
  • 58. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research. 2005;15(8):1034–1050. doi: 10.1101/gr.3715005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Research. 2010;20(1):110–121. doi: 10.1101/gr.097857.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The human genome browser at UCSC. Genome Research. 2002;12(6):996–1006. doi: 10.1101/gr.229102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Pease JB, Hahn MW. More accurate phylogenies inferred from low-recombination regions in the presence of incomplete lineage sorting. Evolution. 2013;67(8):2376–2384. doi: 10.1111/evo.12118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Skov L, Macia MC, Lucotte EA, Cavassim MIA, Castellano D, Schierup MH, et al. Extraordinary selection on the human X chromosome associated with archaic admixture. Cell Genomics. 2023;3(3). doi: 10.1016/j.xgen.2023.100274 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Sankararaman S, Mallick S, Dannemann M, Prüfer K, Kelso J, Pääbo S, et al. The genomic landscape of Neanderthal ancestry in present-day humans. Nature. 2014;507(7492):354–357. doi: 10.1038/nature12961 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Dutheil JY, Munch K, Nam K, Mailund T, Schierup MH. Strong selective sweeps on the X chromosome in the human-chimpanzee ancestor explain its low divergence. PLoS Genetics. 2015;11(8):e1005451. doi: 10.1371/journal.pgen.1005451 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Hibbins MS, Hahn MW. Phylogenomic approaches to detecting and characterizing introgression. Genetics. 2022;220(2):iyab173. doi: 10.1093/genetics/iyab173 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genetics. 2006;2(5):e68. doi: 10.1371/journal.pgen.0020068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Mendes FK, Hahn MW. Why concatenation fails near the anomaly zone. Systematic Biology. 2018;67(1):158–169. doi: 10.1093/sysbio/syx063 [DOI] [PubMed] [Google Scholar]
  • 68. Simonsen KL, Churchill GA. A Markov chain model of coalescence with recombination. Theoretical Population Biology. 1997;52(1):43–59. doi: 10.1006/tpbi.1997.1307 [DOI] [PubMed] [Google Scholar]
  • 69. Jukes TH, Cantor CR, et al. Evolution of protein molecules. Mammalian Protein Metabolism. 1969;3:21–132. doi: 10.1016/B978-1-4832-3211-9.50009-7 [DOI] [Google Scholar]
  • 70. Hein J, Schierup M, Wiuf C. Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford University Press, USA; 2004. [Google Scholar]
  • 71. Hobolth A, Siri-Jegousse A, Bladt M. Phase-type distributions in population genetics. Theoretical Population Biology. 2019;127:16–32. doi: 10.1016/j.tpb.2019.02.001 [DOI] [PubMed] [Google Scholar]
  • 72. Rivas-González I, Andersen LN, Hobolth A. PhaseTypeR: an R package for phase-type distributions in population genetics. Journal of Open Source Software. 2023;8(82):5054. doi: 10.21105/joss.05054 [DOI] [Google Scholar]
  • 73. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–362. doi: 10.1038/s41586-020-2649-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Nelder JA, Mead R. A simplex method for function minimization. The Computer Journal. 1965;7(4):308–313. doi: 10.1093/comjnl/7.4.308 [DOI] [Google Scholar]
  • 75. Gao F, Han L. Implementing the Nelder-Mead simplex algorithm with adaptive parameters. Computational Optimization and Applications. 2012;51(1):259–277. doi: 10.1007/s10589-010-9329-3 [DOI] [Google Scholar]

Decision Letter 0

Pier Francesco Palamara, Xiaofeng Zhu

26 Sep 2023

Dear Dr Rivas-González,

Thank you very much for submitting your Research Article entitled 'TRAILS: tree reconstruction of ancestry using incomplete lineage sorting' to PLOS Genetics.

 

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript.

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Pier Francesco Palamara

Guest Editor

PLOS Genetics

Xiaofeng Zhu

Section Editor

PLOS Genetics

The reviewers agree that this is a valuable and well-presented approach for parameter inference in multi-species alignment data and provide some suggestions that could improve the manuscript. They agree on the need to provide additional details on the computational costs of TRAILS compared to previous approaches such as Coal-HMM and a discussion on the scalability to additional lineages or time intervals. Additional comments include suggestions for improving the presentation of results, as well as clarifying the effects of deviations from underlying assumptions on the absence of ILS with the outgroup and the role of admixture/introgression/selection.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Review of "TRAILS: tree reconstruction of ancestry using incomplete lineage sorting" by Iker Rivas-González, Mikkel H. Schierup, John Wakeley, and Asger Hobolth

In this manuscript, the authors present a novel Coalescent Hidden Markov Model framework TRAILS that is geared towards estimating parameters (speciation times & ancestral population sizes) describing the speciation history of closely related species. In particular, the authors present an application to the human-chimp-gorilla scenario, with orangutan as an outgroup. The method takes as input one genomic sequence for each extant species, including the outgroup. The framework implements the marginal genealogical tree relating the extant lineages at each locus as the hidden state space, where the observed nucleotides are the emissions, and the transition probabilities capture the correlation of genealogies at neighboring loci due to chromosomal linkage and recombination. The method can thus take advantage of incomplete-lineage-sorting (ILS) between the extant species to estimate the parameters.

With the novel method, the authors improve upon the method CoalHMM that had previously been presented for similar applications. The new method allows the coalescent events of ancestral lineages between speciation events to be classified into a finer number of discrete intervals, rather than just a single interval. The authors demonstrate that this improves estimation accuracy in simulated data. Applying their method to data from Chromosome 1, they were able to estimate parameter for the speciation times that are in agreement with the literature. In addition to parameter estimation, the authors demonstrate that their method can be used to obtain and inspect the posterior distribution of the marginal genealogies, which they show is useful to investigate ILS along the genome, identify potentially introgressed genomic regions, and identify candidates for adaptive genetic variation.

The manuscript is well written, presenting the novel method and it's application clearly and accessible. The method is well documented and provided as a software package for convenient use. It is thus a valuable addition to the toolkit of methods to characterize speciation events and related phenomena like ILS and introgression, as well as non-neutral dynamics. I do appreciate the detailed supplement and diagrams therein that make the presentation accessible to non-experts. In addition to some minor comments, I think it would be useful to add some additional simulations and analysis to further support the results and expands on some of the applications.

Particular points:

- p.6, Figure 2, Panel B: I do think that displaying the results here as bars that start at 0 is not helpful and unnecessary. It squeezes the whiskers that show the distribution around the mean, making it hard to compare some of them. My suggestion would be to just show a line for the mean of the estimates and whiskers, and the limits of the y-axis chosen to allow better comparison. Perhaps even points for each estimated value from a replicate. Related to this: This plot is based on 5 replicates. While general trends are definitely exposed by this, I do think a higher number of replicates would be better to reduce the random noise.

- p.8, Figure 4: This Figure demonstrates that TRAILS infers low coalescence times around a genetic variant under selection. I think it would be very helpful to supplement this with replicates of simulated 200kb neutral regions, and tally the distribution of the maximum posterior for those. While panel B) shows the theoretical expected value, and deviation from it, It is unclear how much the statistics vary around the mean under neutrality. Knowing this variance is necessary in empirical applications to assess significance of candidate regions. Thus, I think providing some sense of the variability would better exhibit the potential of the method for this application. Please mention here already that the interval boundaries are chosen such that the expected distribution is uniform.

- p.9, l.191-201: In this paragraph, the estimates of the split times are provided as ranges. perhaps confidence intervals. It is unclear how these ranges are computed, since the preceding details describe how to obtain point estimates. Please provide details on whether these ranges are confidence intervals and how they are obtained. Through bootstrapping of the data? Using curvature of the likelihood surface?

- p. 15, l. 357-262: While the details of the method are presented in the supplement, I think it would be good to provide a few more details about the computation of the transition probabilities in this section of the main text. Perhaps add a sentence like: "We can discretize the CTMC into a DMTC by evaluating it at the boundaries of the discretization intervals. This DMTC can the be used to compute discretized joint probabilities of the genealogies (hidden states) at the left and the right locus by considering the corresponding paths of the DMTC. The transition probabilities can then be obtained upon dividing by the discretized marginals."

- The runtime of the method is not mentioned for any of the analyses presented in the paper. To allow researchers interested in applying the method to judge the resources necessary to perform analyses, I think it is necessary to provide more details here. Please provide details on the runtime (and parallel architecture used) of analysing the simulated replicates for Figure 2B), estimating the parameters from the 50Mb HCGO-alignment presented in Figure 5A), and the posterior decoding presented in Figure 5B).

- In the supplement, the state spaces and transition rates for the CTMC with 1,2, and 3 lineages are presented. However, were these obtained by manually enumerating all of them, or is there some structure underlying these that was used by the authors to enumerate them? The reason for this question is that if these are enumerated by hand, extending the method to 4 or more lineages will be very unwieldy, whereas if some structure of the problem can be used, extensions might be less cumbersome. If the authors have some insight into the structure of the problem, and perhaps some more general formulas, please present these.

- The authors do present elegant approaches in the supplement to compute correct probabilities for the CTMC in cases where multiple coalescent events among the ingroup happen in the last "infinite" interval. However, it appears that t_upper is used as an upper bound for coalescent events among the ingroup when computing the emission probabilities. Is this correct? Are these transition and emission probabilities then combined in the HMM? While I do not think that this will majorly effect results if t_upper is large enough, I think this inconsistency should be highlighted (if it does exist).

Minor points:

- p.5, l.112: a bound-constrained search algorithm that optimizes the likelihood function by evaluating it directly. [I think it would be good to state that no gradients or EM are computed.]

- p.5, l.132: Please clarify this statement. Why do these coalescent events cause underestimation?

- p.7, Figure 3, panel A: Please emphasize (perhaps in the caption) that 'first' and 'second' coalescence event refer to the order of events, thus it is possible that different extant lineages are coalescing at this 'first' or 'second' event at different loci.

- p.8, l.169: The signal observed in the posterior decoding can be summarized by comparing the proportion of sites with the maximum posterior probability in certain time intervals to the theoretical expectation.

- p.9, l.186: ..., choosing the parameter values estimated in Rivas-González as starting values ... [it is unclear what "this branch" refers to.]

- p.9, l.188: The supplement states that other algorithms are possible, so state this here too.

- p.9, l.192: Please provide a reference for the value g=25 years.

- p.10, l.225: Wouldn't it be more appropriate to have a time in years represent each interval and then take the weighted mean of that?

- p.16, l.404: It is stated here that the Nelder-Mead algorithm is used for optimization. Previously, it was stated that the L-BFGS-B algorithm is used. Please clarify.

- p.16, l.409: ... posterior decoding with the true parameters fixed.

Supplement:

- p.2, l.55: ... and sit in different lineages.

- p.4, Figure S3: Add to caption: "Grey indicates the diagonal entries, which are computed as the negative of the sum of the off-diagonal entries in the corresponding row."

- p.4, l.94: Please provide more details how the probabilities are mixed.

- p.6, l.112: ... point t using \\pi_{ABC}' = \\pi_{ABC} exp(tQ_{ABC}).

- p.5, l.115: Remove one period.

- p.8, l.125: ... two topologies is known as incomplete lineage sorting ...

- p.9, l.141: Why is the number given by the Bell number series? Please provide an explanation or a citation.

- p.11, l.181: ... two-sequence CTMC, and, later, that lineage coalesces with ...

- p.12, l.187: Additionally, if the first coalescent event does not happen between ...

- p.16, l.286: Does this need to be F(t) = e^{tQ}?

- p.16, l.293 (and following equations): I believe the order of the matrix exponentials has to be reversed? Each next step has to be multiplied from the right. Thus, e^{rQ} should be the leftmost exponential, followed by (s-r), followed by (t-s). Similar with most other equations in the following sections. I might also be wrong about this.

- p.17, l.325: ... we need to calculate infinite integrals of ...

- p.19, l.343: The states of the DTMC describe the marginal genealogical histories of the sequences. However, these states cannot be observed directly.

- p.21, l.378: Should the rate of the exponential be the inverse of the ancestral population size instead of 1?

- p.23, l.417: Pr(a_0) is not defined. Is it the stationary distribution of the mutation matrix?

Reviewer #2: This is a nice paper and I enjoyed reading and thinking about it. The authors extend a previous approach to ancestral demographic inference from a multi-species genome sequence alignment, by introducing a more sophisticated representation of the coalescent process.

I don't have too many comments or suggestions to make, as it's a fairly self-contained methodological study and the manuscript motivates and describes the methods and approach well. I'm persuaded that this is a useful approach, and a potentially powerful framework for tackling problems in this area. The results on the great ape alignment provide a helpful demonstration of how it might be used.

There were just a few things I think the authors might address in a bit more detail, two of which relate to assumptions of the model.

Firstly, the model explicitly assumes that the outgroup is sufficiently remote that there is zero ILS between it and the A, B and C lineages involved in the focal speciation events. But at the same time it assumes that e.g. mutation and recombination rates have remained unchanged on all these lineages. In reality neither assumption might hold. But the ILS assumption seems particularly relevant. For example in the case of the HCGO divergence there will be about 13% ILS between HC, G and O using the parameters estimated in the paper. How does this impact the performance of the method or inferences drawn from it? Can it be mitigated in filtering the input alignment blocks? (I didn't see a discussion of this in the Methods.)

Secondly, TRAILS fits a 'clean split' speciation model in which there is no admixture between branches after their divergence. The authors discuss how the method might respond when there is potential departure from this in the data, in the context of the long V3 fragment in Fig. 5. One question which arises is whether there are more systematic approaches to detect such signals. For example, can one identify or quantify unexpectedly long fragments based e.g. on their posterior odds under the HMM? For having identified them, one could then look at the numbers of V3 and V2 topologies. Genome-wide asymmetry in these classes could be diagnostic of admixture or introgression, whereas the effects of selection if widespread might be expected to be symmetric. Did (or might) the authors investigate this?

Thirdly, I think it is fair to say that the method is considerably more complex in terms of its underlying machinery than previous approaches such as CoalHMM. It would be good if the authors could comment on how this influences performance and scaling considerations, e.g. what are typical run times, memory requirements etc for the cases presented, particularly as one adds time intervals?

I noticed a couple of typos, in the equation on p. 9 and the preceding text, N_{ABC} should surely be N_{AB}.

I liked the Supplement a lot; it provides a clear discussion which addressed most of the questions I had about the method and its implementation. I do think it will still be difficult to follow for anyone new to the ideas involved, but that's perhaps unavoidable. I have a couple of minor suggestions.

In discussing the basic CTMC, I wonder is it worth noting that there are two aspects of coalescence involved - one being the merging of homologous sequences at a particular locus (e.g. in going from \\omega_{00} to \\omega_{30} or \\omega_{03}), and the other being the linking of two separate loci (e.g. in going from state 1 to another state in \\omega_{00}). It's a minor thing but I think many readers will be more familiar with the first than the second. You might also consider reproducing the state diagram for the CTMC in the single-sequence case, to illustrate the process.

The other suggestion is to change the red/blue colours in Fig S5, as there is potential confusion with other figures in which the same colours distinguish sequences/species.

Aylwyn Scally

Reviewer #3: The ms by Rivas-Gonzalez et al. reports a novel powerful extension of the Coal-HMM framework published by part of the authors some years ago. The ms also advertises for TRAILS, the newest release of a series of softwares that infer ancestral population sizes, speciation times and recombination rates for a genomic alignment of 3 species, plus an outgroup. The study assesses the power and the limitations of the method (and the related software) with great care, using simulations and a human-chimp-gorilla+orang-utan alignment.

I would like first to thank the authors for the care they took to explain the method in a very clear and comprehensive supplementary material. With only basic knowledge of HMM and coalescent theory, it is quite easy to follow, enlightening and enjoyable. It helped me a lot having a better grasp of what was at stake and improved a lot my comprehension of Coal-HMM techniques. Thank you.

More generally, the ms is scientifically sound, easy to read and quite convincing. I have only a list of comments and suggestions that may help to produce an even better/clearer article.

1) My first and most important suggestion is to change your strategy for the figures 2-5 of the main text. As they are currently, they are quite difficult to read and even more to understand. I suspect that you were tempted to provide as much info as you could, but in the end, a casual reader such as my poor self can suffer from the impression of being overwhelmed by the generosity of the figures. I shall now detail few more concrete suggestions, figure by figure.

- Figure 2. Panel A, what is t_upper? Wouldn't a t_3 ranging from AB to ABC be more intuitive? Or maybe there is something I don't get (probably). On panel B, I am not sure whether having both n_{AB}=3 and n_{AB}=3 is really helping. Reducing the number of subplot can only make the figure clearer. In the current format, it is too crowded and fonts are too small for the readers. Furthermore, why don't you use whisker-plots instead of bar-plots?

- Figure 3. I wonder whether the names "V0" to "V3" is a better choice than newick self-explanatory strings such as "((A,B),C)"? Furthermore the posterior probabilities within the heatmaps are not well contrasted so it is not visually convincing that the HMM does a good job (with the exception of the topology). Did you try using log(prob) for the color code? or less categories ? Furthermore, the tree with tiny slices denoted by Ss and Fs is way to small to be read and again not very helpful. Please make it more straightfoward.

- Figure 4. Same remark for the colors of heatmaps. It is not obvious how the Posterior max is computed. Finally, theoretical expectations means theoretical NEUTRAL expectations, correct?

- Figure 5. Panel B has remained opaque to me (I have abandoned, even as a reviewer). Here again, recalling what "V0" and others are implies going back and forth between Figure 5 and Figure 2. I again believe the newick strings is a better choice than V[0-3]. Like in general, font sizes and plots are too small.

2) The underestimation of rho (figure 2 and l131) is intriguing. As the convergence of the other estimates is good, or even very good, this is mind-bugging. The authors suggest that it stems from rapid coalescence after recombination that results in undetectable recombination events. But really what intuitively matters is the occurrence of mutations in the time lag between both types of events. In this case, tuning the \\mu to \\rho ratio would change the strength of the bias, lowering it as it increase. More generally, as it is the only poor estimate, I recommend explore different strategies to overcome the bias or at least better characterizing the issue and discussing it more.

3) The authors provide an estimate of the parameter from a single region of chromosome 1. Having few regions from the same and from different chromosomes and comparing the results is certainly a good move. I somehow have a vague memory that one of the chromosomes had a different pattern of ILS, but I may be totally wrong (this an old memory).

4) Reports of CPU time and memory consumption are lacking. Especially discussing them regarding the differences with previous simpler versions. In general, it is interesting to know how much CPU resource (e.g. carbon) we spend for how much precision we gained. What did we gain for what cost? About CPU time, the complexity is likely linear with alignment size, but how is it with number of time categories in AB and in ABC.

:: A collection of minors remarks ::

l163 : this approximation is not very good. A better one is "2ln(2Ns + c)/s" where c is the Euler constant. At least better use 2Ns than 2N.

l199: insert "new" --> "our NEW estimates"

l269-l278: can we see the xy-plots of the correlated variables (probably in the supp)?

- figure 5C and l257-291. Any temptation to compute some kind of standard neutrality test (e.g. dn/ds on the coding, branch length inflation, or SFS-based using publicly available polymorphisms for this locus)?

- l 321 & l56 (supp): please state that the coalescent occurs between the two ancestors that were carrying the left and right ancestral materials. I did get it, but at first I was disturbed.

- l368 : "ILS can be neglected" instead of "no ILS".

- l429: the second "forward" should be replaced by "backward"

- l112 (supp) \\pi_{AB} should be \\pi_{ABC}

- l115 (supp) ".." -> "."

- p9-10 (supp). I guess the matrices were recoded in sparse format, which really speed up calculation and reduce memory.

- l286 (supp). I am not sure but I think it should be exp(tQ) and not exp(tA)

- l473. This "e" is different from the (2,1) vector described earlier l265, no?

- l481. It is not obvious to me how it works as B is a matrix and e a vector (from what I can understand).

********** 

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

********** 

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Aylwyn Scally

Reviewer #3: No

Decision Letter 1

Pier Francesco Palamara, Xiaofeng Zhu

22 Jan 2024

Dear Dr Rivas-González,

We are pleased to inform you that your manuscript entitled "TRAILS: tree reconstruction of ancestry using incomplete lineage sorting" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Pier Francesco Palamara

Guest Editor

PLOS Genetics

Xiaofeng Zhu

Section Editor

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

All major comments have been addressed. Reviewer 1 has a few minor suggestions that the authors may want to consider.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Second review of "TRAILS: tree reconstruction of ancestry using incomplete lineage sorting" by Iker Rivas-González, Mikkel H. Schierup, John Wakeley, and Asger Hobolth

The authors have addressed my major concerns. I do believe that the additional simulations to assess the variability of the parameter estimates (and revised Figure 2) and the posterior decoding allow the reader to better assess the method and how it performs in applications. The added discussion of the runtimes further helps. I do have two minor suggestions, which, I think, can be left to the authors to potentially address. I don't think that addressing them is necessary for publication.

- It could be good to refer to the supplmentary section "Beyond three species" in the "Discussion" in the main text as potential for extensions of the method.

- I believe that the coalescence rates, recombination rates, and mutation rates in the computations presented in the supplement are all scaled by N_e? If so, it might be worthwhile mentioning this, as it also affects the boundaries of the discretization intervals and the interpretation of the inferred times (see comment and response on t_upper from Reviewer #3).

Reviewer #2: These edits look good to me and I'm satisfied that the authors have addressed the points I raised.

Reviewer #3: I congratulate the authors to have edited their ms to make it even more clear and sound. I have no further comment.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Aylwyn Scally

Reviewer #3: No

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-23-00699R1

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Pier Francesco Palamara, Xiaofeng Zhu

5 Feb 2024

PGENETICS-D-23-00699R1

TRAILS: tree reconstruction of ancestry using incomplete lineage sorting

Dear Dr Rivas-González,

We are pleased to inform you that your manuscript entitled "TRAILS: tree reconstruction of ancestry using incomplete lineage sorting" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Judit Kozma

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Supplementary notes, including Figs A to S, and Tables A and B.

    S1 Text contains a detailed description of the theoretical framework and implementation of TRAILS, together with supplementary analyses.

    (PDF)

    pgen.1010836.s001.pdf (2.9MB, pdf)
    Attachment

    Submitted filename: TRAILS_answer_to_reviewers.pdf

    pgen.1010836.s002.pdf (378.7KB, pdf)

    Data Availability Statement

    The python package for TRAILS can be downloaded and installed from pip (https://pypi.org/project/trails-rivasiker/), and the source code can be browsed at https://github.com/rivasiker/trails. The code for reproducing the figures in the manuscript can be found at https://github.com/rivasiker/trails_paper.


    Articles from PLOS Genetics are provided here courtesy of PLOS

    RESOURCES