Kumar et al. 10.1073/pnas.0509585102. |
Fig. 5. Multifactor bootstrap-resampling (MBR) approach used for estimating 95% confidence intervals. (a) Distribution of the ratio of ape-OWM to human-chimpanzee divergence times obtained from an analysis of third codon position data using nested (open bars) and nonnested (filled bars) multilevel resampling regimes. The difference between nested and nonnested resampling is that once we resample genes with replacement in the MBR procedure, we resample sites only once for each gene in the nonnested case and multiple times in the nested case. (In this case, to accentuate the difference between the two methods, we did not use supergene resampling.) Therefore, when 1,000-gene resampled data sets are generated in an MBR procedure and sites within genes are resampled 500 times, the nested resampling produces 500,000 final data sets, whereas nonnested resampling produces 1,000 final data sets in which site resampling is carried out only once. The difference between these two distributions is not significant (95% C.I. for the nested case is 3.635.97 Ma, whereas for the nonnested case it is 3.615.97 Ma). (Corresponding ratios are indicated in the figure by triangular marks.) We chose to use a nonnested methodology (with fewer resamplings), because it takes a much smaller amount of computer time.
Supporting Text
Data Collection.
To maximize the number of useable genes and to exploit a close primate calibration species to estimate human-chimpanzee divergence time, we generated orthologous sequence sets for human (Homo sapiens), chimpanzee (Pan troglodytes), macaque (Macaca mulatta), and mouse (Mus musculus). We began with all available macaque proteins, because macaque has the poorest representation in genetic sequence databases as compared to the other three species. Of the 1,050 macaque protein sequences available from the EOL project, 663 unique sequences were retained after removing multiple sequences of the same gene. Protein BLAST search (E £ 10-10) in the GenBank database (www.ncbi.nih.gov) and Ensembl (www.ensembl.org) of the chimpanzee databases were conducted using the macaque protein sequences; the protein sequences obtained were automatically aligned using CLUSTALW with default settings (1). Orthologous sequences were identified using the protein phylogenies that were inferred by the neighbor-joining method (2, 3) with maximum likelihood distances estimated under the JTT model (4) in MEGA3 (3). For each set of orthologous sequences, all sites containing gaps and missing data were removed. Then all proteins with <40 aa were excluded to avoid using very short sequences. Genomic sequences were preferred whenever available, because of their higher sequence accuracy (5). This resulted in 167 four-species protein alignments.The Multifactor Bootstrap-Resampling (MBR) Procedure.
Because of the difficulty of analytically modeling all sources of variation in divergence time estimation, we used a bootstrap approach that we refer to as MBR. It is different from the approach used, for example, in ref. 6, in which a source of variation is modeled in a Bayesian analysis and sampled during the MCMC analysis. In the MBR process, the genes are first resampled with replacement to account for the variance introduced by the use of a limited number of genes. This step is skipped in the case of homogeneous data (e.g., genomic neutral sites from different genes), which we refer to as supergene resampling. However, it is an important step if amino acid sequences are analyzed, because evolutionary rates for individual proteins are strongly influenced by natural selection. In the second step, sites are resampled with replacement in the alignment of sequences, either for each gene individually or for the concatenation of all genes (supergene). This accounts for variances introduced in the process of estimating evolutionary divergence using a finite number of sites. In the third step, a lineage for time estimation is randomly selected (e.g., chimpanzee or human) to account for uncertainty stemming from rate variation between lineages in local clock analyses. Because we have no reason to prefer one lineage to another, this widens the confidence interval (C.I.) to encompass both. Although this step incorporates rate variation in building C.I., it does not explicitly model rate variation, and it is unnecessary in Bayesian analysis, in which rate variation among lineages is explicitly modeled. The fourth step incorporates calibration time uncertainty by drawing the calibration time from a specified probability distribution. If multiple calibration points are used, then this step should be carried out for each calibration point before proceeding further. This step is omitted if we do not incorporate the uncertainty in the calibration time estimate.The above four steps yield one bootstrap-replicate data set with resampled sequences and one (or more) calibration time(s). The divergence time for this bootstrap replicate can be estimated by ML-distance and Bayesian (or any other) methods in exactly the same way as the original data set. It is important to note that the Bayesian analysis (e.g., in MULTIDIVTIME) may be used to get credibility intervals around the estimates, but they are not used in the MBR process, because this would result in a questionable mixing of frequentist and Bayesian confidence measures. Therefore, MBR only requires a point estimate of divergence time from each replicate. This process is repeated 1,000 times, and the resulting distribution defines the C.I. between the 2.5 and 97.5 percentiles (7). It is important to note that the variance contributed by the lack of identifiability (e.g., inability to perfectly separate rates and times when given a set of perfectly estimated branch lengths, as in ref. 8) cannot be accounted for by the MBR procedure, because it will have the same effect in the Bayesian inference in all of the bootstrap replicates.
Because we used a multilevel bootstrapping of structured data (9), we examined whether the bootstrap replicates for each level should be drawn hierarchically (with many site resamplings from a single gene resampling) or linearly (one site resampling from a given gene resampling) and found that they produce equivalent results (Fig. 5). Therefore, we used the linear method, as outlined above, because it is computationally much less demanding. Although we have described the MBR approach for a simple case of four taxa, its implementation for a larger number of species, more calibration points, or different divergence time estimation methods is straightforward.
MBR C.I. Evaluations.
We conducted computer simulations to evaluate the statistical accuracy of the MBR method in generating an appropriate C.I. when used in conjunction with the ML-distance and Bayesian methods for a large number of genes and only for the four species here under consideration. The model phylogeny used in the computer simulations is given in Fig. 1a and assumed t, T, and M to be 5, 25, and 90 Ma, respectively. The ape-OWM and primate-rodent times were chosen to reflect our current understanding of the divergence times for these species based on the molecular data (10-12); the human-chimpanzee divergence time was chosen to have a simple 5:1 ratio for T:t (see also ref. 13). In generating a full simulated data set to mimic the empirical set, we randomly sampled the given number of parameter sets from the collection of evolutionary parameter sets that were derived from the 167 sequence alignments assembled. To simulate deviations from a molecular clock, we assigned evolutionary rates to individual branches for a given parameter set under three scenarios: equal rate, random rate, and correlated rate among lineages. In the equal rate case, all of the branches in the tree were assigned the same expected rate. In the random rate case, the rate at each branch was randomly selected from a uniform distribution of rates so as to introduce a ±40% random noise in evolutionary rate; this was done independently for each gene. In the correlated rate, the assignment of lineage-specific rate uses a stationary lognormal distribution in which the rates vary from branch to branch as a random walk for a given gene, such that the parent and descendant branches are unlikely to differ radically in rate, but rates may drift up or down along the branches of any lineage (14, 15). In this case, the evolutionary rates in the descendents depend on the rates of the ancestors; hence, we refer to this as a correlated rate model. All data were generated by using SEQGEN software (16, 17) and analyzed by using ML-distance and Bayesian analysis procedures in the same way as for the empirical data.Use of Fossil Record to Date Ape-OWM Divergence.
The Oligocene-Miocene boundary at 23.8 Ma (18) serves as a minimum calibration time of the ape-OWM divergence because the earliest representatives of both lineages appear in the fossil record at approximately this time, or only a few million years later (19, 20). However, greater precision in radiometric dating is needed for the earliest fossils of cercopithecoids and hominoids. For example, one poorly known anthropoid fossil (21) from the Upper Oligocene of Kenya shows resemblances to some early Miocene hominoids and might yet prove to be an Oligocene hominoid. Nearly all of the early Miocene sites with definite hominoids were radiometrically dated in the 1960s by potassium-argon methods (22), and they should be redated by using the more accurate 40Ar-39Ar method (23).Most geologic boundaries were established based on extinction events or other major changes observed in the geologic and paleontological records. The Oligocene-Miocene boundary, which also corresponds to the major division between the Paleogene and Neogene, is not an exception. It is known to have been associated with global climatic change, extinctions, and subsequent adaptive radiations (24, 25). For example, there is a peak in primate lineage (Family) originations in the early Miocene (26), although this may be an artifact of poor primate representation in the late Oligocene of Africa. At the time of the boundary in Africa, faunal change and extinctions were associated, in part, with the establishment of the land bridge between Afro-Arabia and southwest Asia (27, 28). Detailed geologic dating of the boundary is still under debate, with estimates ranging from 22.9 to 24.0 Ma (29, 30).
The third justification for using 23.8 Ma as a calibration point for the ape-OWM divergence is the close correspondence with the date obtained for that divergence, 23.3 ± 1.2 Ma (52 genes sampled), in an earlier molecular clock study (31). In that analysis, different data (protein sequences) and methods were used, including Mesozoic and Paleozoic calibration points.
Parameters Used for Bayesian Analysis.
The following parameters were used in the multicntrl.dat file for Bayesian analysis: numsamps, 10000; sampfreq, 100; burnin, 100000; brownmean, 0.04; brownsd, 0.04; minab, 1.0; newk, 0.1; othk, 0.5; thek, 0.5; bigtime, 110; nodata, 0; commonbrown, 0).1. Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994) Nucleic Acids Res. 22, 4673-4680.
2. Saitou, N. & Nei, M. (1987) Mol. Biol. Evol. 4, 406-425.
3. Kumar, S., Tamura, K. & Nei, M. (2004) Brief Bioinform. 5, 150-163.
4. Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992) Comput. Appl. Biosci. 8, 275-282.
5. Furey, T. S., Diekhans, M., Lu, Y., Graves, T. A., Oddy, L., Randall-Maher, J., Hillier, L. W., Wilson, R. K. & Haussler, D. (2004) Genome Res. 14, 2034-2040.
6. Weinstock, J., Willerslev, E., Sher, A., Tong, W., Ho, S. Y., Rubenstein, D., Storer, J., Burns, J., Martin, L., Bravi, C., et al. (2005) PLoS Biol 3, e241.
7. Efron, B. & Tibshirani, R. (1993) An Introduction to the Bootstrap (Chapman & Hall, New York).
8. Thorne, J. L. & Kishino, H. (2005) in Statistical Methods in Molecular Evolution, ed. Nielsen, R. (Springer, New York), pp. 233-256.
9. Hox, J. J. (2002) Multilevel Analysis: Techniques and Applications (Lawrence Erlbaum Associates, Mahwah, NJ).
10. Springer, M. S., Murphy, W. J., Eizirik, E. & OBrien, S. J. (2003) Proc. Natl. Acad. Sci. USA 100, 1056-1061.
11. Hasegawa, M., Thorne, J. L. & Kishino, H. (2003) Genes Genet. Syst. 78, 267-283.
12. Hedges, S. B. & Kumar, S. (2003) Trends Genet. 19, 200-206.
13. Stauffer, R. L., Walker, A., Ryder, O. A., Lyons-Weiler, M. & Hedges, S. B. (2001) J. Hered. 92, 469-474.
14. Kishino, H., Thorne, J. L. & Bruno, W. J. (2001) Mol. Biol. Evol. 18, 352-361.
15. Aris-Brosou, S. & Yang, Z. (2002) Syst. Biol. 51, 703-714.
16. Grassly, N. C., Adachi, J. & Rambaut, A. (1997) Comput. Appl. Biosci. 13, 559-560.
17. Rambaut, A. & Grassly, N. C. (1997) Comput. Appl. Biosci. 13, 235-238.
18. Remane, J., Cita, M. B., Dercourt, J., Bouysse, P., Repetto, F. & Faure-Muret, A. (2002) International Stratigraphic Chart (International Union of Geological Sciences, Paris).
19. Pickford, M. & Andrews, P. (1981) J. Hum. Evol. 10.
20. Andrews, P., Harrison, T., Martin, L. & Pickford, M. (1981) J. Hum. Evol. 10, 123-128.
21. Leakey, M. G., Ungar, P. S. & Walker, A. (1995) J. Hum. Evol. 28, 519-531.
22. Bishop, W. W., Miller, J. A. & Fitch, F. J. (1969) Am. J. Sci. 267, 669-699.
23. McDougall, I. & Harrison, T. M. (1999) Geochronology and Thermochronology by the 40Ar/39Ar Method (Oxford Univ. Press, New York).
24. Roberts, A. P., Wilson, G. S., Harwood, D. M. & Verosub, K. L. (2003) Palaeogeogr. Palaeocl. 198, 113-130.
25. Spencer-Cervato, C. (1999) Palaeontologia Electronica 2, Issue 2 (Oct. 22). Available at http://palaeo-electronica.org.
26. Benton, M. J. (1993) The Fossil Record 2 (Chapman & Hall, London).
27. Kappelman, J., Rasmussen, D. T., Sanders, W. J., Feseha, M., Bown, T., Copeland, P., Crabaugh, J., Fleagle, J., Glantz, M., Gordon, A., et al. (2003) Nature 426, 549-552.
28. Adams, C. G., Bayliss, D. D. & Whittaker, J. E. (1999) in Fossil Vertebrates of Arabia, eds. Whybrow, P. J. & Hill, A. (Yale Univ. Press, New Haven, CT), pp. 477-483.
29. Wilson, G. S., Lavelle, M., McIntosh, W. C., Roberts, A. P., Harwood, D. M., Watkins, D. K., Villa, G., Bohaty, S. M., Fielding, C. R., Florindo, F., et al. (2002) Geology 30, 1043-1046.
30. International Union of Geological Sciences (2004) International Stratigraphic Chart (International Commission on Stratigraphy, International Union of Geological Sciences, Paris).
31. Kumar, S. & Hedges, S. B. (1998) Nature 392, 917-920.