Skip to main content
BMC Genomic Data logoLink to BMC Genomic Data
. 2024 Jan 2;25:4. doi: 10.1186/s12863-023-01185-8

Dating ancient splits in phylogenetic trees, with application to the human-Neanderthal split

Keren Levinstein Hallak 1, Saharon Rosset 1,
PMCID: PMC10759710  PMID: 38166646

Abstract

Background

We tackle the problem of estimating species TMRCAs (Time to Most Recent Common Ancestor), given a genome sequence from each species and a large known phylogenetic tree with a known structure (typically from one of the species). The number of transitions at each site from the first sequence to the other is assumed to be Poisson distributed, and only the parity of the number of transitions is observed. The detailed phylogenetic tree contains information about the transition rates in each site. We use this formulation to develop and analyze multiple estimators of the species’ TMRCA. To test our methods, we use mtDNA substitution statistics from the well-established Phylotree as a baseline for data simulation such that the substitution rate per site mimics the real-world observed rates.

Results

We evaluate our methods using simulated data and compare them to the Bayesian optimizing software BEAST2, showing that our proposed estimators are accurate for a wide range of TMRCAs and significantly outperform BEAST2. We then apply the proposed estimators on Neanderthal, Denisovan, and Chimpanzee mtDNA genomes to better estimate their TMRCA with modern humans and find that their TMRCA is substantially later, compared to values cited recently in the literature.

Conclusions

Our methods utilize the transition statistics from the entire known human mtDNA phylogenetic tree (Phylotree), eliminating the requirement to reconstruct a tree encompassing the specific sequences of interest. Moreover, they demonstrate notable improvement in both running speed and accuracy compared to BEAST2, particularly for earlier TMRCAs like the human-Chimpanzee split. Our results date the human – Neanderthal TMRCA to be 408,000 years ago, considerably later than values cited in other recent studies.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12863-023-01185-8.

Keywords: Divergence times, Time to most recent common ancestor (TMRCA), Mitochondrial DNA (mtDNA), BEAST2, Transition rates, Poisson parity, Ancient humans, Neanderthal, Denisovan, Chimpanzee

Background

Dating species divergence has been studied extensively for the last few decades using approaches based on genetics, archaeological findings, and radiocarbon dating [1, 2]. Finding accurate timing is crucial in analyzing morphological and molecular changes in the DNA, in demographic research, and in dating key fossils. One approach for estimating the divergence times is based on the molecular clock hypothesis [3, 4] which states that the rate of evolutionary change of any specified protein is approximately constant over time and different lineages. Subsequently, statistical inference can be applied to a given phylogenetic tree to infer the dating of each node up to calibration.

Our work focuses on this estimation problem and proposes new statistical methods to date the TMRCA in a coalescent tree of two species given a detailed phylogenetic tree for one of the species with the same transition rates per site. Our work does not detect introgression events, and in cases of introgression [5] should be used alongside methods for introgression detection (e.g. [6]). We note that dating the TMRCA in a coalescent tree is different than finding the population tree divergence time. A discussion regarding the differences between the two is available in [7, 8]. Specifically, the discordance of nuclear and mtDNA histories [9] suggests the coalescent tree and population tree of humans may have a different topology.

We formulate the problem of dating the TMRCA by modeling the number of transitions (AG,CT) in each site using a Poisson process with a different rate per site; sites containing transversions are neglected due to their sparsity (indeed, we include sparse transversions in the simulations and show that our methods are robust to their occurrences). The phylogenetic tree is used for estimating the transition rates per site. Hence, when considering two representative sequences, one from each population, our problem reduces to a binary sequence where the parity of the number of transitions of each site is the relevant statistic from which we can infer the time difference between them.

We can roughly divide the approaches to solving this problem into two. The frequentist approach seeks to maximize the likelihood of the observed data. Most notable is the PAML [10] package of programs for phylogenetic analyses of DNA and the MEGA software [11]. Alternatively, the Bayesian approach considers a prior of all the problem’s parameters and maximizes the posterior distribution of the observations. Leading representatives of the Bayesian approach are BEAST2 [12] and MrBayes [13], which are publicly available programs for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models.

In this work, we developed several distinct estimators from frequentist and Bayesian approaches to find the TMRCA directly. The proposed estimators differ in their assumptions on the generated data, the approximations they make, and their numerical stability. We explain each estimator in detail and discuss its properties.

A critical difference between our proposed solutions and existing methods is that we seek to estimate only one specific problem parameter. At the same time, software packages such as BEAST2 and PAML optimize over a broad set of unknown parameters averaging the error on all of them (the tree structure, the timing of every node, the per-site substitution rates, etc.). Subsequently, the resources they require for finding a locally optimal instantiation of the tree and dating all its nodes can be very high in terms of memory and computational complexity. Consequently, the number of sequences they can consider simultaneously is highly limited. Thus, unlike previous solutions, we utilize transition statistics from all available sequences, in the form of a previously built phylogenetic tree.

We develop a novel approach to simulate realistic data to test our proposed solutions. To do so, we employ Phylotree [14, 15] – a complete, highly detailed, constantly updated reconstruction of the human mitochondrial DNA phylogenetic tree. In our work we assume all substitutions are specified by Phylotree (for an elaborate discussion regarding the correction of this claim see [16]). We sample transitions of similar statistics to Phylotree and use it to simulate a sequence at a predefined trajectory from Phylotree’s root.

We then empirically test the different estimators on simulated data and compare our results to the BEAST2 software. Our proposed estimators are calculated substantially faster while utilizing the transitions statistics from all available sequences (Phylotree considers 24,275 sequences), unlike BEAST2 which can consider only dozens of sequences due to its complexity. Comparing with the ground truth, we show that BEAST2 slightly overestimates the TMRCA, while our estimates provide more accurate results. For larger TMRCAs such as the human-Chimpanzee, BEAST2 also has a larger variance compared to our methods. Finally, we use our estimators to date the TMRCA (given in kya – kilo-years ago) of modern humans with Neanderthals, Denisovan and Chimpanzee based on their mtDNA. Surprisingly, the TMRCAs we find (human-Neanderthals 408 kya, human-Denisovans 841 kya, human-Chimpanzee 5,010 kya) – are considerably later than those accepted today.

Methods

Estimation methods

First, we describe an idealized reduced mathematical formulation for estimating TMRCAs and our proposed solutions. In Estimating ancient TMRCAs using a large modern phylogeny section, we describe the reduction process in greater detail.

Consider the following scenario: we have a set of n Poisson rates, denoted as {λi}i=1n where nN. Let X be a vector of length n such that each element Xi is independently distributed as Pois(λi). Similarly, let Y be a vector of length n such that each element Yi is independently distributed as Pois(λi·p) for a fixed unknown p. We denote Z as the coordinate-wise parity of Y, meaning that Zi=1 if Yi is odd and Zi=0 otherwise. Our goal is to estimate p given X and Z.

Remark 1

Note that the number of unknown Poisson rate parameters n in the problem {λi}i=1n grows with the number of observations {(Xi,Zi)}i=1n. However, our focus is solely on estimating p, so additional observations do provide more information.

Remark 2

The larger the value of p·λi, the less information on p is provided in Zi as it approaches a Bernoulli distribution with a probability of 0.5. On the other hand, the smaller λi is, the harder it will be to infer λi from Xi. As a result, the problem of estimating p should be easier in settings where λi is high and p is low.

Preliminaries

First, we derive the distribution of Zi; All proofs are provided in the Supplementary material (Section 1).

Lemma 1

Let YPois(Λ) and Z be the parity of Y. Then ZBer(12(1-e-2Λ)).

We use this result to calculate the likelihood and log-likelihood of p and λ given X and Z. The likelihood is given by:

LX,Z;p,λ=i=1ne-λiλiXiXi!121+-1Zie-2λip, 1

and the log-likelihood is:

lX,Z;p,λ=i=1n-λi+Xilogλi+log1+-1Zie-2λip+Const. 2

This result follows immediately from the independence of each coordinate.

Cramer-Rao bound

We begin our analysis by computing the Cramer-Rao bound (CRB; [17, 18]). In Comparative study on raw simulations section, we compare the CRB to the error of the estimators.

Theorem 1

Denote the Fisher information matrix for the estimation problem above by IR(n+1,n+1), where the first n indexes correspond to {λi}i=1n and the last index (n+1) corresponds to p. For clarity denote Ip,pIn+1,n+1,Ii,pIi,n+1,Ip,iIn+1,i. Then:

ij,1i,jn:Ii,j=0,Ii,i=1λi+4p2e4λip-1,Ii,p=Ip,i=4pλie4λip-1,Ip,p=4i=1nλi2e4λip-1. 3

Consequently, an unbiased estimator p^ holds:

E(p-p^)24i=1nλi2e4λip-1+4p2λi-1. 4

If i=1..n:λi=λ, we can further simplify the expression:

E(p-p^)2e4λp-1+4p2λ4nλ2. 5

The CRB, despite its known looseness in many problems, provides insights into the sensitivity of the error to each parameter. This expression supports our previous observation that the error of an unbiased estimator increases exponentially with mini{λi·p}. However, for constant λi·p, the error improves for higher values of λi. We now proceed to describe and analyze several estimators for p.

Method 1 - maximum likelihood estimator

Proposition 1

Following equation 1, the maximum likelihood estimators p^,λ^i hold:

i=1nλ^i=i=1nXi,Xi=λ^i+2p^λ^i-1Zie2λ^ip^+1,i=1nλ^i-1Zie2λ^ip^+1=0. 6

Proposition 1 provides n separable equations for maximum likelihood estimation (MLE). Our first estimator sweeps over values of p^ (grid searching in a relevant area) and then for each i=1..n finds the optimal λ^i numerically. The solution is then selected by choosing the pair (p^,{λ^i}i=1n) that maximizes the log-likelihood calculated using equation 2.

The obtained MLE equations are solvable, yet, finding the MLE still requires solving n numerical equations, which might be time-consuming. More importantly, MLE estimation is statistically problematic when the number of parameters is of the same order as the number of observations [19]. Subsequently, we propose alternative methods that may yield better practical results.

Method 2 - λi-conditional estimation

We propose a simple estimate of λ based solely on Xi, followed by an estimate of p as if λ is known, considering only Z. This method is expected to perform well when λi values are large, as in these cases, Xi conveys more information about λi than Zi. This approach enables us to avoid estimating both λ and p simultaneously, leading to a simpler numerical solution.

When p1, we can mimic Yi’s distribution as a sub-sample from Xi, i.e. we assume that Yi|XiBin(n=Xi,p). Then, we find the maximum likelihood estimate of p:

Proposition 2

If Yi|XiBin(Xi,p), then:

  1. YiPois(λi·p), which justifies this approach.

  2. Zi|XiBer121-1-2pXi, so we can compute the likelihood of p without considering λi.

  3. The maximum likelihood estimate of p given i=1nZi holds:
    i=1n1-2p^Xi=n-2i=1nZi 7
Remark 3

We use the maximum likelihood estimation of p given i=1nZi by applying Le-Cam’s theorem [20]. This eliminates the need for a heuristic solution of the pathological case Xi=0,Zi=1.

Method 3 - Gamma distributed Poisson rates

The Bayesian statistics approach incorporates prior assumptions about the parameters. A common prior for the rate parameters λ is the Gamma distribution, which is used in popular Bayesian divergence time estimation programs such as MCMCtree [10], BEAST2 [12], and MrBayes [13]. Specifically, we have λiΓ(α,β), and for p, we use a uniform prior over the positive real line.

Proposition 3

Let λiΓ(α,β), then the maximum a posteriori estimator of p holds:

lp=i=1nXi+α-1Zi1+2pβ+1Xi+α+1=0 8

Subsequently, given estimated values for α and β, we can find an estimator for p numerically to hold Equation 8. Unfortunately, the derivative with respect to α does not have a closed-form expression, nor is it possible to waive the dependence on Z,p. Hence, we suggest using Negative-Binomial regression [21] to estimate α and β given X .

Estimating ancient TMRCAs using a large modern phylogeny

In this section, we apply the methods described in Estimation methods section to estimate the non-calibrated TMRCAs of humans and their closest relatives by comparing mitochondrial DNA (mtDNA) sequences. Our approach assumes the following assumptions:

  1. Molecular clock assumption - the rate of accumulation of transitions (base changes) over time and across different lineages is constant, as first proposed by Zuckerkandl and Pauling [3] and widely used since.

  2. Poisson distribution - The number of transitions along the human and human’s closest relatives mtDNA lineages follows a Poisson distribution with site-dependent rate parameter λi per time unit (implying that the rate of accumulation of transitions is independent of the time since the last transition).

  3. No transversions - We only consider sites with no transversions and assume a constant transition rate per site (λi,AG=λi,GA, or λi,TC=λi,CT).

  4. Independence of sites - The number of transitions at each site is independent of those at other sites.

  5. Phylogenetic tree - The phylogenetic tree presented in the Phylotree database includes all transitions and transversions that occurred along the described lineages.

As the Phylotree database is based on tens of thousands of sequences, the branches in the tree correspond to relatively short time intervals, making multiple mutations per site unlikely in each branch [22]. However, when considering the mtDNA sequence of other species, the branches in the tree correspond to much longer time intervals, meaning that many underlying transitions are unobserved. For instance, when comparing two human sequences that differ in a specific site, Phylotree can determine whether the trajectory between the sequences was AG, AGAG, or ATG. However, when comparing sequences of ancient species, an elaborate phylogenetic tree like Phylotree is not available, making it impossible to discriminate between these different trajectories.

We use the following notation:

  1. Let XmtDNA denote the number of transitions observed at each site along the human mtDNA phylogenetic tree as described by Phylotree. Each coordinate corresponds to a different site out of the 16,569 sites. The number of transitions at site i, XmtDNA,i, follows a Poisson distribution with parameter λi.

  2. Let Y denote the number of transitions between two examined sequences (e.g. a modern human and a Neanderthal). We normalize the length of the tree edges so that the sum of all Phylotree’s edges is one. The estimated parameter p relates to the edge distance between the two examined sequences. Subsequently, Yi follows a Poisson distribution with parameter λi·p.

  3. Let Z denote the parity of Y.

Using X and Z, we can estimate p using the methods in Estimation methods section. The TMRCA is given by: 12(Tsequence 1+Tsequence 2+p) when Tsequence 1,2 are the estimated times of the examined sequences measured in (uncalibrated) units of phylotree’s total tree length. For clarity, we summarize the process described in this section in Fig. 1.

Fig. 1.

Fig. 1

We use a large, comprehensive phylogeny, such as Phylotree, and assume its tree topology and branch lengths are known. We also assume that the phylogeny is detailed enough so that it describes all the substitutions that occurred between its sequences. From this detailed phylogeny, we extract a list of the number of transitions and transversions that occurred in each site along the phylogeny to get XmtDNA (composed of the number of transitions that occurred at each site) and a list of phylogeny transversion sites - sites in which at least one transversion occurred along the phylogeny. We aim to estimate the distance between two sequences that are not necessarily part of the tree and may be much more distant than branch tree lengths. To do so, we extract a binary vector Z that states for each site whether the sequences are identical (zi=0, marked black) or different (zi=1, marked orange). We check in which sites a transversion must have occurred between the two sequences (marked blue) and remove these sites from XmtDNA and Z, thus shortening these vectors. We do the same for the phylogeny transversion sites

Calibration

Our methods output p, which is the ratio of two values:

  1. The sum of the edges between the two examined sequences and their most recent common ancestor (MRCA).

  2. The total sum of Phylotree’s edges.

Similarly to BEAST2, to calibrate p to years, we use the per-site per-year substitution rate for the coding region given in [23] μ= 1.57 x 10E-8. We then calculate the total sum of Phylotree’s edges in years by dividing the average number of substitutions in the coding region per site (1.4) by μ.

Results

Comparative study on raw simulations

To compare the performance of the three estimation methods described in Estimation methods section, we conducted experiments using simulated data. The Poisson rates λ were generated to reflect the substitution rates observed in mtDNA data using either a Categorical or a Gamma distribution. The parameters for the Gamma distribution (α=0.23,β=0.164) were estimated directly from the data, while the parameters for the Categorical distribution were chosen such that both distributions have the same mean and variance. One of the Categorical values (ϵ=0.1) corresponds to the rate of low activity sites in the mtDNA data. The other value (a=11.87) and the probabilities (0.11, 0.89) were chosen accordingly. To test the robustness of our methods, we have also conducted simulations using the celebrated K2P [24] and TN93 [25] substitution models, with rate matrix parameters and site scaling extracted from Phylotree’s data. Further simulation details appear in the Supplementary material, Section 2.1. The comparison results are shown in Fig. 2 with the Cramer-Rao bound for reference. To provide a qualitative comparison, we performed a one-sided paired Wilcoxon signed rank test on every pair of models, correcting for multiple comparisons using the Bonferroni correction. Our results show that Method 2 has the lowest squared error while Method 1 has the highest squared error, for all distributions and substitution models. It is noteworthy that although Method 3 assumes a Gamma distribution, it still performs well even when a model mismatch exists.

Fig. 2.

Fig. 2

Box-plot of the log squared estimation errors of the three proposed methods for selected values of p, expressed as a percentage of the total length of Phylotree’s edges (outliers are marked with ). The simulations were run 10, 000 times for each value of p. The CRB is shown in black for reference and the circles represent the log of the mean values which are comparable to the CRB. The experiments were conducted for two different distributions of λ and two different substitution models: (Top left) Categorical distribution with two values: ϵ=0.1 with probability η=0.11 and a=11.87 with probability 1-η.(Top right) Gamma distribution with parameters α and β. (Bottom left) Site-scaled K2P substitution model. (Bottom right) Site-scaled TN93 substitution model

Phylogenetic tree simulations

We validated our methods by testing their performance in a more realistic scenario of simulating a phylogenetic tree. Our methods take as input the observed transitions along Phylotree (XmtDNA) using all of its 24,275 sequences and a binary vector Z denoting the differences between two sequences, which we aim to estimate the distance between. We compared our methods to the well-known BEAST2 software [12], which, similarly to other well-established methods (such as MCMCtree [10], MrBayes [13], etc.) considers sequences along with their phylogenetic tree to produce time estimations. The software BEAST2 performs Bayesian analysis using MCMC to average over the space of possible trees. However, it is limited in its computational capacity, so it cannot handle a large number of sequences like those in Phylotree. For this reason, we used a limited set of diverse sequences, including mtDNA genomes of 53 humans [26], the revised Cambridge Reference Sequence (rCRS) [27], the root of the human phylogenetic mtDNA tree, termed Reconstructed Sapiens Reference Sequence (RSRS) [28], and 10 ancient modern humans [23]. More details about the parameters used by BEAST2 are available in the Supplementary material, Section 2.3. To evaluate our methods, we added a simulated sequence with a predefined distance from the RSRS.

Our aim is to generate a vector λ that produces a vector X that has a similar distribution to XmtDNA. The human mtDNA tree has 16,569 sites, of which 15,629 have no transversions. The MLE of λi at each site is the observed number of transitions, XmtDNA,i. However, simulating λ as XmtDNA leads to an undercount of transitions because 10,411 sites (67% of the total number of sites considered) had no transitions along the tree and their Poisson rate is taken to be zero. To mitigate this issue, the rates for these sites were chosen to be ϵ, the value that minimizes the Kolmogorov-Smirnov statistic [29, 30] (details are provided in the Supplementary material, Section 2.2).

The results are presented in Fig. 3. Similarly to Fig. 2, Method 1 has a larger error than Methods 2 and 3 for values of p within the simulated region, and the gap widens with increasing p. Methods 2 and 3 provide the best results for the entire range of p. Compared to Methods 2 and 3, BEAST2 is less accurate and has a larger variance for higher values of p. Additionally, BEAST2 has a much longer running time (roughly 1 hour) compared to our methods (less than a second). BEAST2 simulation presented here was conducted using a fixed tree topology. The results for a simulation without a fixed tree topology (running time 1.5 hours) are presented in Supplementary Fig. 2. Furthermore, to test the effect of additional sequences, we conducted simulations with an additional 50 human sequences for selected values of p and a fixed tree topology (running time 3 hours). Supplementary Fig. 3 presents a comparison of estimation errors for various BEAST2 estimators compared to Method 2. Finally, to examine the effect of removing transversion sites, we conducted simulations with different ti/tv ratios showing that decreasing the ti/tv ratio does not result in bias (Supplementary Fig. 4).

Fig. 3.

Fig. 3

Comparison of our methods with BEAST2 estimator using simulated data. The right plot shows a zoom-in view of the left plot, focusing on values of p between 0 and 20%. Each point in the plot represents the average of 5 runs, while the shaded regions indicate the range of estimations obtained. We note that in order to compare to our methods we only present here point estimates for BEAST2 (posterior means) and not the HPD of the full posterior distribution

Real data results

As the final step of our experiments, we apply our methods to real-world data to determine the TMRCA of the modern human and Neanderthal, Denisovan, and chimpanzee mtDNA genomes. A schema of our estimation process is provided in Fig. 4. Table 1 displays the uncalibrated distances between modern humans and each sequence, compared to the estimates from BEAST2. The presented TMRCA represents an average of the TMRCA obtained from 55 modern human mtDNA sequences of diverse origins [26]. Table 2 presents the TMRCA in kya (kilo-years ago) of the modern human and each sequence. We note that if some of the modern human sequences were very close to one another, a weighted average would be more appropriate, considering the proximity of sequences and giving sequences closer to one another a smaller weight. As the sequences in our case are from diverse origins, a uniform average is a good approximation. An alternative strategy to calculating the average TMRCA with all human mtDNA sequences is to use the RSRS instead. Moreover, we can use the MRCA of Neanderthals instead of using all Neanderthal sequences and the same for any other population with a confidently known MRCA, resulting in one estimate instead of multiple pairwise estimates. This strategy involves combining different possibly noisy TMRCAs (one for each population and one for the distance between the populations).

Fig. 4.

Fig. 4

We used the following schema to obtain estimates for the time distance between real-world sequences: We first extracted XmtDNA from Phylotree. Then, we determined the binary vector Z that denotes the differences between the RSRS and the sequence in consideration. We removed from both XmtDNA and Z the phylogeny transversion sites and sequences transversion sites as explained in Fig. 1. We applied our estimation methods using XmtDNA and Z as input to get uncalibrated p and calibrated p as explained in Calibration section. Finally, the TMRCA of the examined sequence and humans is given by TMRCA=12(TRSRS+TSequence+pcalibrated)

Table 1.

Uncalibrated distances between modern humans and selected hominins

Sample BEAST2 Method 1 Method 2 Method 3
Altai 0.98 (±0.08) 0.8 (±0.08) 0.79 (±0.08) 0.79 (±0.08)
Denisova15 0.99 (±0.08) 0.81 (±0.08) 0.8 (±0.08) 0.8 (±0.08)
HST 0.99 (±0.07) 0.78 (±0.08) 0.78 (±0.08) 0.77 (±0.08)
Mezmaiskaya1 1.03 (±0.08) 0.86 (±0.09) 0.85 (±0.09) 0.85 (±0.09)
Chagyrskaya08 1.04 (±0.08) 0.84 (±0.09) 0.83 (±0.09) 0.83 (±0.08)
ElSidron1253 1.06 (±0.08) 0.83 (±0.08) 0.82 (±0.08) 0.82 (±0.08)
Vindija33.17 1.08 (±0.08) 0.86 (±0.09) 0.85 (±0.09) 0.85 (±0.09)
Feldhofer1 1.09 (±0.08) 0.85 (±0.09) 0.84 (±0.08) 0.83 (±0.08)
GoyetQ56-1 1.09 (±0.08) 0.88 (±0.09) 0.88 (±0.09) 0.87 (±0.09)
GoyetQ57-2 1.09 (±0.08) 0.84 (±0.09) 0.83 (±0.08) 0.83 (±0.08)
Les Cottes Z4-1514 1.09 (±0.08) 0.91 (±0.09) 0.91 (±0.09) 0.9 (±0.09)
Mezmaiskaya2 1.09 (±0.08) 0.84 (±0.09) 0.83 (±0.09) 0.83 (±0.08)
Vindija33.16 1.09 (±0.08) 0.87 (±0.09) 0.86 (±0.09) 0.86 (±0.09)
Vindija33.25 1.09 (±0.08) 0.85 (±0.09) 0.84 (±0.09) 0.83 (±0.09)
GoyetQ305-7 1.09 (±0.08) 0.89 (±0.09) 0.89 (±0.09) 0.88 (±0.09)
GoyetQ374a-1 1.09 (±0.08) 0.89 (±0.09) 0.89 (±0.09) 0.88 (±0.09)
Spy 94a 1.09 (±0.08) 0.88 (±0.09) 0.88 (±0.09) 0.87 (±0.09)
Sima de los Huesos 1.79 (±0.11) 1.42 (±0.12) 1.39 (±0.11) 1.39 (±0.11)
Denisova2 2.02 (±0.12) 1.68 (±0.13) 1.65 (±0.12) 1.64 (±0.12)
Denisova8 2.06 (±0.12) 1.69 (±0.13) 1.66 (±0.13) 1.65 (±0.12)
Denisova4 2.18 (±0.12) 1.83 (±0.14) 1.79 (±0.13) 1.78 (±0.13)
Denisova3 2.19 (±0.12) 1.82 (±0.13) 1.78 (±0.13) 1.77 (±0.13)
Chimpanzee 16.33 (±1.22) 12.75 (±0.68) 11.21 (±0.53) 11.21 (±0.53)

Uncalibrated distances expressed as a percentage of the total length of Phylotree’s edges, as determined by our methods compared with BEAST2. The values correspond to p, and indicate the estimation’s location in Fig. 3. In the parentheses, we provide the standard deviation for each estimator, obtained from bootstrapping 100 site samples for every modern human – ancient sequence pair in the dataset. Note that the BEAST2 values presented here were de-calibrated as described in Calibration section

Table 2.

Estimated TMRCAs of modern human and selected hominins

Sample BEAST2 Method 1 Method 2 Method 3
Altai 425.75 (±39.9) 423.51 (±39.47) 421.09 (±39.02)
Denisova15 427.61 (±39.38) 425.13 (±38.91) 422.75 (±38.54)
HST 414.19 (±40.78) 411.68 (±40.34) 409.72 (±39.95)
Mezmaiskaya1 430.56 (±41.13) 428.13 (±40.6) 425.64 (±40.28)
Chagyrskaya08 416.76 (±39.93) 414.23 (±39.36) 411.74 (±39.07)
ElSidron1253 403.37 (±38.1) 400.94 (±37.58) 398.43 (±37.23)
Vindija33.17 410.96 (±39.74) 408.44 (±39.19) 406.05 (±38.86)
Feldhofer1 399.36 (±38.45) 396.93 (±37.91) 394.37 (±37.59)
GoyetQ56-1 416.06 (±39.76) 413.6 (±39.11) 411.1 (±38.86)
GoyetQ57-2 394.71 (±38.28) 392.28 (±37.78) 389.72 (±37.43)
Les Cottes Z4-1514 428.91 (±41.04) 426.3 (±40.42) 423.95 (±40.07)
Mezmaiskaya2 395.99 (±38.64) 393.58 (±38.11) 391.02 (±37.74)
Vindija33.16 409.84 (±39.4) 407.47 (±38.81) 404.79 (±38.53)
Vindija33.25 399.91 (±39.27) 397.48 (±38.74) 394.92 (±38.4)
GoyetQ305-7 418.16 (±40.35) 415.45 (±39.69) 413.3 (±39.4)
GoyetQ374a-1 418.16 (±39.72) 415.45 (±39.07) 413.3 (±38.8)
Spy 94a 415.03 (±40.47) 412.57 (±39.83) 410.07 (±39.55)
Humans-Neandertals 507.13 ( ± 33.86) 410.46 ( ± 37.12) 408.02 ( ± 36.78) 405.67 ( ± 36.29)
Sima de los Huesos 850.67 (±66.75) 840.25 (±65.49) 837.38 (±65.06)
Denisova2 866.15 (±60.08) 851.42 (±57.98) 847.86 (±57.63)
Denisova8 850.67 (±61.93) 835.27 (±59.65) 832.22 (±59.4)
Denisova4 860.43 (±62.55) 843.23 (±60.36) 838.71 (±59.91)
Denisova3 853.07 (±60.73) 836.1 (±58.52) 831.41 (±58.13)
Humans-Denisovans-Sima 1,017.98 ( ± 53.56) 856.2 ( ± 54.75) 841.26 ( ± 52.89) 837.52 ( ± 52.29)
Humans-Chimpanzee 7,292.72 ( ± 545.48) 5,693.51 ( ± 302.59) 5,009.78 ( ± 235.05) 5,005.39 ( ± 237.13)

The table displays the estimated TMRCAs (in kya) between modern humans and selected hominins, as determined by our methods and compared with BEAST2. The standard deviation, which arises from a combination of the standard deviation of our methods and the sample dating, is given in parentheses. It’s important to note that BEAST2 calculates the TMRCA for all sequences in the same clade as a single estimate, while our methods estimate the TMRCA for each sample individually by taking the average of estimations derived from comparing the sample with every modern human sequence in the dataset

The estimates from real-world sequences presented in Table 1 are consistent with those obtained for the simulated dataset in Phylogenetic tree simulations section. For low values of p, our three methods all produce similar estimates while BEAST2’s has a slightly higher estimate. For the human-Chimpanzee uncalibrated distance, which is relatively high, Method 1 provides a higher estimate than that obtained by Methods 2 and 3, and BEAST2 provides a substantially higher estimate. The results in Table 2 show the TMRCA estimates, which are significantly smaller for our methods than those obtained from BEAST2 for human-Neanderthals and human-Denisovans. For example, BEAST2 estimated the human – Sima de los Huesos – Denisovans TMRCA as 1,018 kya, while our best-performing method (2) estimated it as 841 kya. This TMRCA is estimated as (540-1,410 kya) in [31]. Similarly, BEAST2 estimated the human – Neanderthal TMRCA as 507 kya, while our methods estimated it as 408 kya. Preceding literature estimates this time closer to ours (400 kya [3234]) while recent literature provides a much earlier estimate (800 kya [35]). Finally, BEAST2 estimates the human-Chimpanzee TMRCA as 7,293 kya whereas our estimate is 5,010 kya, both are close to the literature value of 5 – 8 million years ago [3639].

Conclusions

We investigated an estimation problem arising in statistical genetics when estimating the TMRCA of species. The problem’s formulation, estimating Poisson rates from parity samples, leads to multiple estimators with varying assumptions. We calculated the CRB for this estimation problem and compared our methods against commonly used BEAST2 in different empirical settings, including a simple sampling scheme (Comparative study on raw simulations section), a more elaborate generative scheme based on real-world mtDNA data (Phylogenetic tree simulations section), and the calculation of the TMRCA of modern humans and other hominins using their mtDNA genomes (Real data results section).

Our results indicate that our proposed methods are significantly faster and more accurate than BEAST2, especially for earlier TMRCAs such as the human-Chimpanzee. Our methods utilize the transition statistics from the entire known human mtDNA phylogenetic tree (Phylotree) without the need for reconstructing a tree containing the sequences of interest. Our results show that the human – Neanderthal TMRCA is 408,000 years ago, considerably later than the values obtained by BEAST2 (507,000 years ago) and other values cited in the literature.

Supplementary Information

Additional file 1. (589.5KB, pdf)

Acknowledgements

Not applicable.

Authors’ contributions

S.R. supervised research, K.L.H. performed the experiments and analyzed the data, S.R. and K.L.H. conceived and designed the experiments, performed the statistical analysis, and wrote the paper.

Funding

This work was partially supported by the Israeli Science Foundation grant 2180/20 and by the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University.

Availability of data and materials

The code used in this work is available at: https://github.com/Kerenlh/DivergenceTimes.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Dos Reis M, Donoghue PC, Yang Z. Bayesian molecular clock dating of species divergences in the genomics era. Nat Rev Genet. 2016;17(2):71–80. doi: 10.1038/nrg.2015.8. [DOI] [PubMed] [Google Scholar]
  • 2.Taylor, RE. Radiocarbon dating in archaeology. Encyclopedia of Global Archaeology. Cham: Springer International Publishing; 2020. p. 9050-9060.
  • 3.Zuckerkandl E, Pauling L, Kasha M, Pullman B. Horizons in biochemistry. Horizons in biochemistry. 1962;97–166.
  • 4.Zuckerkandl E, Pauling L. In Evolving Genes and Proteins, ed. by V. Bryson & HJ Vogel. New York: Academic Press; 1965. [DOI] [PubMed]
  • 5.Posth C, Wißing C, Kitagawa K, Pagani L, van Holstein L, Racimo F, et al. Deeply divergent archaic mitochondrial genome provides lower time boundary for African gene flow into Neanderthals. Nat Commun. 2017;8(1):1–9. doi: 10.1038/ncomms16046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Pickrell J, Pritchard J. Inference of population splits and mixtures from genome-wide allele frequency data. Nat Precedings. 2012;1-1. [DOI] [PMC free article] [PubMed]
  • 7.Pettengill JB. The time to most recent common ancestor does not (usually) approximate the date of divergence. PLoS ONE. 2015;10(8):e0128407. doi: 10.1371/journal.pone.0128407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Sağlam İK, Baumsteiger J, Miller MR. Failure to differentiate between divergence of species and their genes can result in over-estimation of mutation rates in recently diverged species. Proc R Soc B Biol Sci. 1860;2017(284):20170021. doi: 10.1098/rspb.2017.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Reich D, Green RE, Kircher M, Krause J, Patterson N, Durand EY, et al. Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature. 2010;468(7327):1053–1060. doi: 10.1038/nature09710. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
  • 11.Kumar S, Tamura K, Nei M. MEGA: molecular evolutionary genetics analysis software for microcomputers. Bioinformatics. 1994;10(2):189–191. doi: 10.1093/bioinformatics/10.2.189. [DOI] [PubMed] [Google Scholar]
  • 12.Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu CH, Xie D, et al. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput Biol. 2014;10(4):e1003537. doi: 10.1371/journal.pcbi.1003537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ronquist F, Teslenko M, Van Der Mark P, Ayres DL, Darling A, Höhna S, et al. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 2012;61(3):539–542. doi: 10.1093/sysbio/sys029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Van Oven M, Kayser M. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum Mutat. 2009;30(2):E386–E394. doi: 10.1002/humu.20921. [DOI] [PubMed] [Google Scholar]
  • 15.Van Oven M. PhyloTree Build 17: Growing the human mitochondrial DNA tree. Forensic Sci Int Genet Suppl Ser. 2015;5:e392–e394. doi: 10.1016/j.fsigss.2015.09.155. [DOI] [Google Scholar]
  • 16.Levinstein-Hallak K, Tzur S, Rosset S. Big data analysis of human mitochondrial DNA substitution models: a regression approach. BMC Genomics. 2018;19(1):1–13. doi: 10.1186/s12864-018-5123-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Cramer H. Mathematical methods of statistics. Princeton: Princeton Univ. Press; 1946. [Google Scholar]
  • 18.Rao CR. Information and the accuracy attainable in the estimation of statistical parameters. Reson J Sci Educ. 1945;20:78–90. [Google Scholar]
  • 19.Cox DR, Reid N. Parameter orthogonality and approximate conditional inference. J R Stat Soc Ser B (Methodol). 1987;49(1):1–18. [Google Scholar]
  • 20.Le Cam L. An approximation theorem for the Poisson binomial distribution. Pac J Math. 1960;10(4):1181–1197. doi: 10.2140/pjm.1960.10.1181. [DOI] [Google Scholar]
  • 21.Hilbe JM. Negative binomial regression. Cambridge: Cambridge University Press; 2011.
  • 22.Soares P, Ermini L, Thomson N, Mormina M, Rito T, Röhl A, et al. Correcting for purifying selection: an improved human mitochondrial molecular clock. Am J Hum Genet. 2009;84(6):740–759. doi: 10.1016/j.ajhg.2009.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Fu Q, Mittnik A, Johnson PL, Bos K, Lari M, Bollongino R, et al. A revised timescale for human evolution based on ancient mitochondrial genomes. Curr Biol. 2013;23(7):553–559. doi: 10.1016/j.cub.2013.02.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980;16:111–120. doi: 10.1007/BF01731581. [DOI] [PubMed] [Google Scholar]
  • 25.Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993;10(3):512–526. doi: 10.1093/oxfordjournals.molbev.a040023. [DOI] [PubMed] [Google Scholar]
  • 26.Ingman M, Kaessmann H, Pääbo S, Gyllensten U. Mitochondrial genome variation and the origin of modern humans. Nature. 2000;408(6813):708–713. doi: 10.1038/35047064. [DOI] [PubMed] [Google Scholar]
  • 27.Andrews RM, Kubacka I, Chinnery PF, Lightowlers RN, Turnbull DM, Howell N. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet. 1999;23(2):147. doi: 10.1038/13779. [DOI] [PubMed] [Google Scholar]
  • 28.Behar DM, Van Oven M, Rosset S, Metspalu M, Loogväli EL, Silva NM, et al. A “Copernican” reassessment of the human mitochondrial DNA tree from its root. Am J Hum Genet. 2012;90(4):675–84. [DOI] [PMC free article] [PubMed]
  • 29.Kolmogorov A. Sulla determinazione empirica di una lgge di distribuzione. Inst Ital Attuari, Giorn. 1933;4:83–91. [Google Scholar]
  • 30.Smirnov N. Table for estimating the goodness of fit of empirical distributions. Ann Math Stat. 1948;19(2):279–281. doi: 10.1214/aoms/1177730256. [DOI] [Google Scholar]
  • 31.Meyer M, Fu Q, Aximu-Petri A, Glocke I, Nickel B, Arsuaga JL, et al. A mitochondrial genome sequence of a hominin from Sima de los Huesos. Nature. 2014;505(7483):403–406. doi: 10.1038/nature12788. [DOI] [PubMed] [Google Scholar]
  • 32.Noonan JP, Coop G, Kudaravalli S, Smith D, Krause J, Alessi J, et al. Sequencing and analysis of Neanderthal genomic DNA. Science. 2006;314(5802):1113–1118. doi: 10.1126/science.1131412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Endicott P, Ho SY, Stringer C. Using genetic evidence to evaluate four palaeoanthropological hypotheses for the timing of Neanderthal and modern human origins. J Hum Evol. 2010;59(1):87–95. doi: 10.1016/j.jhevol.2010.04.005. [DOI] [PubMed] [Google Scholar]
  • 34.Rieux A, Eriksson A, Li M, Sobkowiak B, Weinert LA, Warmuth V, et al. Improved calibration of the human mitochondrial clock using ancient genomes. Mol Biol Evol. 2014;31(10):2780–2792. doi: 10.1093/molbev/msu222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Gómez-Robles A. Dental evolutionary rates and its implications for the Neanderthal–modern human divergence. Sci Adv. 2019;5(5):eaaw1268. [DOI] [PMC free article] [PubMed]
  • 36.Kumar S, Filipski A, Swarna V, Walker A, Hedges SB. Placing confidence limits on the molecular age of the human-chimpanzee divergence. Proc Natl Acad Sci. 2005;102(52):18842–18847. doi: 10.1073/pnas.0509585102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Langergraber KE, Prüfer K, Rowney C, Boesch C, Crockford C, Fawcett K, et al. Generation times in wild chimpanzees and gorillas suggest earlier divergence times in great ape and human evolution. Proc Natl Acad Sci. 2012;109(39):15716–15721. doi: 10.1073/pnas.1211740109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Amster G, Sella G. Life history effects on the molecular clock of autosomes and sex chromosomes. Proc Natl Acad Sci. 2016;113(6):1588–1593. doi: 10.1073/pnas.1515798113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Stone AC, Battistuzzi FU, Kubatko LS, Perry GH, Jr, Trudeau E, Lin H, et al. More reliable estimates of divergence times in Pan using complete mtDNA sequences and accounting for population structure. Phil Trans R Soc B Biol Sci. 2010;365(1556):3277–3288. doi: 10.1098/rstb.2010.0096. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional file 1. (589.5KB, pdf)

Data Availability Statement

The code used in this work is available at: https://github.com/Kerenlh/DivergenceTimes.


Articles from BMC Genomic Data are provided here courtesy of BMC

RESOURCES