Significance
The Bayesian method is widely used to estimate species phylogenies using molecular sequence data. While it has long been noted to produce spuriously high posterior probabilities for trees or clades, the precise reasons for this overconfidence are unknown. Here we characterize the behavior of Bayesian model selection when the compared models are misspecified and demonstrate that when the models are nearly equally wrong, the method exhibits unpleasant polarized behaviors, supporting one model with high confidence while rejecting others. This provides an explanation for the empirical observation of spuriously high posterior probabilities in molecular phylogenetics.
Keywords: Bayesian inference, fair-coin paradox, model selection, posterior probability, star-tree paradox
Abstract
The Bayesian method is noted to produce spuriously high posterior probabilities for phylogenetic trees in analysis of large datasets, but the precise reasons for this overconfidence are unknown. In general, the performance of Bayesian selection of misspecified models is poorly understood, even though this is of great scientific interest since models are never true in real data analysis. Here we characterize the asymptotic behavior of Bayesian model selection and show that when the competing models are equally wrong, Bayesian model selection exhibits surprising and polarized behaviors in large datasets, supporting one model with full force while rejecting the others. If one model is slightly less wrong than the other, the less wrong model will eventually win when the amount of data increases, but the method may become overconfident before it becomes reliable. We suggest that this extreme behavior may be a major factor for the spuriously high posterior probabilities for evolutionary trees. The philosophical implications of our results to the application of Bayesian model selection to evaluate opposing scientific hypotheses are yet to be explored, as are the behaviors of non-Bayesian methods in similar situations.
The Bayesian method was introduced into molecular phylogenetics in the 1990s (1–3) and has since become one of the most popular methods for statistical analysis in the field, in particular, for estimation of species phylogenies (4–7). It has been noted that the method often produces very high posterior probabilities for trees or clades (nodes in the tree). In the first-ever Bayesian phylogenetic calculation, a biologically reasonable tree for five species of great apes was produced from a dataset of 11 mitochondrial tRNA genes (739 sites), but the posterior probability for that tree, at 0.9999, was uncomfortably high (1). In the past two decades, the Bayesian method has been used to analyze thousands of datasets, with the computation made possible through Markov chain Monte Carlo (MCMC) (4, 5). It has become a common practice to report posterior clade probabilities only if they are (because most estimates are 100%). In some cases the high posterior probabilities are decidedly spurious. For example, conflicting trees may be inferred from the same data under different evolutionary models. Different trees may be inferred depending on the species sampled in the dataset (8) or on whether protein sequences or the encoding DNA sequences are analyzed (9). In such cases, the different trees cannot all be correct, even if the true tree is unknown. The concern is not so much that the inferred species relationships may be wrong but that they are supported by extremely high posterior probabilities.
In the star-tree paradox, large datasets were simulated using the star tree and then analyzed to calculate the posterior probabilities for the three binary trees (Fig. 1). Most biologists would want the posterior probabilities for the binary trees to converge to when the amount of data increases (10–12). Instead they fluctuate among datasets according to a statistical distribution, sometimes producing strong support for a binary tree even though the data do not contain any information either for or against any binary tree (13–15).
Fig. 1.
(A) The three binary rooted trees for three species , and and the star tree . (B) The three binary unrooted trees for four species , and and the star tree . The branch length parameters are shown next to the branches, measured by the expected number of nucleotide changes per site. In the star-tree simulations, the star tree is used to generate data, which are analyzed to calculate the posterior probabilities for the three binary trees, with the star tree excluded.
Bayesian model selection is known to be consistent (16). When the data size , the true model “dominates,” with its posterior probability approaching 1. If several models are equally right, the model with fewer parameters dominates. However, this theory applies only if the true model is included in the comparison. Given that a model is a simplified representation of the physical world, the more common situation in real data analysis should be the comparison of models that are all wrong. Not many theoretical results appear to exist concerning Bayesian comparison of misspecified models (17).
Here we study the asymptotic behavior of Bayesian model selection in a general setting where multiple misspecified models are compared. We are interested in how the posterior probabilities for models behave when the data size increases. Do the dynamics depend on whether there are any free parameters in the models? If one model is less wrong than another (in a certain sense appropriately defined), will the less wrong model always win? We present the proofs and mathematical analyses in General Theory for Equally Wrong Models with No Free Parameters (d=0) and General Theory for Equally Right or Equally Wrong Models with Free Parameters (d>0). In the main text, we summarize our results and illustrate them using three canonical simple problems. Our analysis suggests that the problem exposed by the star-tree paradox is actually far more troubling than discussed previously (11–15).
Results
Problem Description.
We consider independent and identically distributed (i.i.d.) models only. The data are an i.i.d. sample from the true model . We consider two models as the case for more models is obvious. Model has density , with free parameters (), . We are in particular interested in models of the same dimension, with . In the Bayesian analysis, we assign a uniform prior for the two models ( and also a prior for the parameters within each model : . The posterior model probabilities, , are then proportional to the marginal likelihoods: ; that is, . We are interested in the asymptotic behavior of in large datasets (as ).
The dynamics depend on how well the models fit the data. Let be the maximum-likelihood estimate (MLE) of under model from dataset . Let be the limiting value of when the data size . In other words minimizes the Kullback–Leibler (K-L) divergence from model to the true model,
| [1] |
and is known as the best-fitting or pseudotrue parameter value under the model (18). (calculated at ) measures the distance from to the true model, with . We say a model is “right” if it encompasses the true model, with , and “wrong” if . Model 1 is less wrong than model 2 if . Both models are “equally right” if and “equally wrong” if .
Characterization of Bayesian Model Selection.
The asymptotic behavior of when is analyzed in SI Text and summarized in Fig. 2. We identify three types of asymptotic behaviors: type 1 (“balanced”), type 2 (“volatile”), and type 3 (“polarized”), as defined below. We also refer to three types of inference problems that give rise to those behaviors.
Fig. 2.
Classification of Bayesian model-selection problems involving two equally right or equally wrong models, each with free parameters. Solid circles represent the true model, while the lines represent the parameter space of the compared models, with the open circles showing the best-fitting parameter value . The two models are equally right (with ) if the solid and open circles coincide and equally wrong (with ) if they are separate. The models are “indistinct” if the two open cycles coincide (as in A and B) and are “distinct” if they are separate (as in C). The green, orange, and red boxes indicate the three different asymptotic behaviors of Bayesian model selection when the data size .
Type 1 (balanced) is for the posterior model probability to converge (as ) to a single reasonable value that is different from 0 to 1, such as . In other words, in essentially every large dataset, . This behavior occurs when the two models are essentially identical. Examples include comparison of two identical models with no parameters, such as and irrespective of the true in a coin-tossing experiment (Fig. 2, cases and ), and overlapping models where the best-fitting parameter values lie in the region of overlap (Fig. 2, and ). Whether the two models are both right ( and ) or both wrong ( and ) does not affect the dynamics. The case of overlapping models is interesting. If the truth is while the two compared models are and , and if we assign a uniform prior on in each model, then as , , which appears more reasonable than as it favors the more-informative model . At any rate, the comparison of identical or overlapping models is unusual for testing scientific hypotheses. This type of problem is not considered further.
Type 2 (volatile) is for to converge to a nondegenerate statistical distribution, such as . In other words, if we analyze different large datasets, all generated from the same true model, to compare two equally right or equally wrong models, varies among datasets according to a nondegenerate distribution. This behavior occurs when the two compared models become unidentifiable as the data size . There are two scenarios. In the first one, both models are right, with (Fig. 2, and ). In the second one both models are equally wrong (with ) but indistinct (Fig. 2, and ). We say that two models are indistinct if and only if they, each at the best-fitting parameter values, are unidentifiable, with for essentially all . In other words, in infinite data, the two models make essentially the same predictions about the data and are unidentifiable. In both scenarios of equally right and equally wrong models, varies among datasets according to a nondegenerate distribution.
Type 3 (polarized) is for to have a degenerate two-point distribution, at values 0 and 1. If we analyze large datasets to compare two models, we favor model 1 with total confidence in some datasets and model 2 with total confidence in others. This behavior is observed when the two models are equally wrong and also distinct.
It is remarkable that the asymptotic behavior is determined by whether or not the compared models are distinct and not by whether they are both right or both wrong or by whether the compared models have unknown parameters. For example, cases (two right models) and (two equally wrong models) in Fig. 2 show the same volatile behavior, while cases (no free parameters) and (with free parameters) show the same polarized behavior.
Problem 1. Fair-Coin Paradox (Equally Wrong Models with No Free Parameter).
Consider a coin-tossing experiment in which the coin is fair with the probability of heads . We use the data of heads in tosses to compare two models: (tail bias) and (head bias). The two models are equally wrong. We assign a uniform prior for the two models ( each) and calculate the posterior model probability . This is a type-3 problem (Fig. 2, ).
As the models involve no free parameters, the likelihood and marginal likelihood are the same, given by the binomial probability for data . The posterior odds are the likelihood-ratio
| [2] |
When is large, tends to be extreme (close to 0 or 1). Indeed, if and only if . If is large, is ∼, so that
| [3] |
where is the cumulative distribution function (CDF) for . If , we have = 11.33296, so that only 11 data outcomes will give in the range (0.01, 0.99), with being . For , we have = 0.280, 0.090, 0.0286, and 0.0090 using the normal approximation of Eq. 3 or 0.272, 0.0876, 0.0277, and 0.0088 exactly by the binomial distribution. Thus, in large datasets, moderate posterior probabilities will be rare, and either or will be favored with posterior . When , has a degenerate two-point distribution, taking the values 0 and 1, each half of the times. This is the type-3 polarized behavior. Note that there is no information either for or against either model in the data. Fig. 3 A, i shows the distribution of for .
Fig. 3.
The distribution of posterior model probability in three inference problems. (A) Problem 1 (fair-coin paradox) is for a coin-tossing experiment, where the true model is (a fair coin), and the compared models are (A, i) and so that the two models are equally wrong and (A, ii) and so that is less wrong than . The data size (the number of coin tosses) is . (B) Problem 2 (fair-balance paradox) is for a normal-distribution example in which the true model is , and the two compared models are and , with variance given. The two models are equally right when and equally wrong but indistinct when or 9. The data size is . The plots for or are nearly the same. (C) Problem 3 (fair-balance paradox) is for a normal-distribution example in which the true model is , and the two compared models are and , with (C, i) and , so that the two models are equally wrong, and (C, ii) and , so that is less wrong than . The prior is under each model, with . The data size is . All densities are estimated by simulating samples for .
Fig. 3 A, ii shows the comparison of against when the truth is . Here is less wrong and will eventually dominate. However, in large and finite datasets, the more wrong model can often receive high support. For example, for , nonextreme posterior probabilities in the range occur for only 13 data outcomes, with being 504–516, and in of datasets, is greater than those values so that . Indeed over the whole range , the more wrong model is strongly favored too often, with . The method becomes overconfident before it becomes reliable. It may be noted that such strong support for the more wrong model occurs only when the two models are opposing each other. It does not occur if both models are wrong in the same direction: In the comparison of and when the truth is , the less wrong model dominates in the posterior.
Problem 2. Fair-Balance Paradox (Equally Right Models or Equally Wrong and Indistinct Models).
The true model is , and we compare two models , and , with given. The data may represent measurement errors observed on a fair balance while the models claim that the balance has an unknown negative or positive bias. The best-fitting parameter value (the MLE when the data size ) is in each model, when the two models become identical (indistinct). Thus, the two models are equally right if (Fig. 2, ), and are equally wrong if or 9 (Fig. 2, ).
We assign a uniform prior on the two models ( each), and with fixed, truncated to the appropriate range under each model. The data , an i.i.d. sample from , can be summarized as the sample mean . It can be shown that the posterior model probability varies among datasets according to the density
| [4] |
where is the inverse CDF for (Analysis of Problem 2 (Two Equally Right Models or Equally Wrong but Indistinct Models)).
Fig. 3B shows the density of for different values of precision , with . If , the two models are equally right, and when so that behaves like a random number (11, 12). If , the assumed variance is larger than the true variance, so that the distribution has a mode at . If , the assumed variance is too small, and has a U-shaped distribution. If one overstates the precision of the experiment, one tends to overinterpret the data and generate extreme posterior model probabilities. In all three cases (), has a nondegenerate distribution.
Problem 3. Fair-Balance Paradox (Equally Wrong and Distinct Models).
The true model is , and the two compared models are and , with given, while is a free parameter in each model. The best-fitting parameter value is in each model, irrespective of the value of assumed. Both models are wrong because of the misspecified variance: is overdispersed while is underdispersed. They are equally wrong, in the sense that in Eq. 1, if
| [5] |
(Analysis of Problem 3 (Two Equally Wrong and Distinct Models, Gaussian with Incorrect Variances)). This is a type-3 problem (Fig. 2, ). We assign a uniform prior over the models ( each), and , with given, within each model. The dataset, an i.i.d. sample of size from , can be summarized as the sample mean and sample variance . The posterior odds are given in Eq. S15 in Analysis of Problem 3 (Two Equally Wrong and Distinct Models, Gaussian with Incorrect Variances).
We use and , so that Eq. 5 holds and the two models are equally wrong, to generate independent variables and and to calculate . Fig. 3 C, i shows the estimated density of for , with . When , degenerates into a two-point distribution at 0 and 1, each with probability . These are the same dynamics as in problem 1 (Fig. 3 A, i), even though in problem 1 the models do not involve any unknown parameters while here they do.
Fig. 3 C, ii shows the density of when (which is closer to the true than is 0.25), so that is less wrong than (with ). In this case when , . However, in large but finite datasets, for the more wrong model can be large in too many datasets: For example, with , : in of datasets, the more wrong model has posterior higher than .
Star-Tree Paradox and Bayesian Phylogenetics.
In Bayesian phylogenetics (1, 2), each model has two components: the phylogenetic tree describing the relationships among the species and the evolutionary model describing sequence evolution along the branches on the tree (19). Each tree has a set of time or branch-length parameters , which measure the amount of evolutionary changes along the branches. The evolutionary model may also involve unknown parameters . The tree and the evolutionary model together specify the likelihood (20), with being the unknown parameters. One of the trees is true, and all other trees are wrong, while the evolutionary model may be misspecified. The main objective is to infer the true tree. The data consist of an alignment of sequences from the modern species and have a multinomial distribution in which the categories correspond to the possible site patterns (configurations of nucleotides observed in the modern species) while the data size is the number of sites or alignment columns (21).
Here we consider three simple cases involving three or four species (Fig. 1). We use the general theory described above to predict the asymptotic behavior of posterior probabilities for trees and use computer simulation to verify the predictions.
Case A (Fig. 4 A and A´) involves equally right models. We use the rooted star tree for three species with (Fig. 1A) to generate datasets to compare the three binary trees. The Jukes–Cantor (JC) substitution model (22) is used both to generate and to analyze the data, which assumes that the rate of change between any two nucleotides is the same. The molecular clock (rate constancy over time) is assumed as well, so that the parameters in each binary tree are the two ages of nodes (), measured by the expected number of nucleotide changes per site.
Fig. 4.
The distribution of posterior probabilities for the three binary trees , and of Fig. 1, when datasets (sequence alignments of or sites) are simulated using the star tree and analyzed to compare the three binary trees. In A and A´, the true tree is the star tree for three species of Fig. 1A, with . Both the simulation and analysis models are JC, and the three binary trees are equally right models. In B and B´, the true tree is the star tree for three species of Fig. 1A, with . The simulation model is JC+ (with ), and the analysis model is JC. The three binary trees represent equally wrong and indistinct models. In C and C´, the true tree is the star tree for four species of Fig. 1B, with . The simulation model is JC+ and the analysis model is JC. The three binary trees represent equally wrong and distinct models. The three corners in the plots correspond to points (1, 0, 0), (0, 1, 0), and (0, 0, 1), while the center is .
The best-fitting parameter values are and for each of the three binary trees, in which case each binary tree converges to the true star tree. We assign uniform prior probabilities for the binary trees ( each) and an exponential prior on branch lengths on each tree. According to our characterization, this is a type-2 problem of comparing equally right models (Fig. 2, ), so the posterior probabilities should have a nondegenerate distribution. This case was considered in previous studies (12, 14, 15), which generated numerically the limiting distribution of the posterior probabilities for the binary trees when and pointed out that they do not converge to (11–13).
Case B (Fig. 4 B and B´) involves equally wrong models that are indistinct. This is similar to case A except that the JC+ model (22, 23) is used to generate data, with different sites in the sequence evolving at variable rates according to the gamma distribution with shape parameter . The data are then analyzed using JC (equivalently to JC+ with ). The best-fitting parameter values (i.e., the MLEs of branch lengths in infinite data) are and under each of the three binary trees. The binary trees thus represent equally wrong models (with in Eq. 1) that are indistinct. The posterior tree probabilities have a nondegenerate distribution. This is the type-2 volatile behavior for equally wrong and indistinct models (Fig. 2, ).
Case C (Fig. 4 C and C´) involves equally wrong and distinct models. Like case B, the simulation model is JC+ with , and the analysis model is JC. However, we do not assume the molecular clock and consider unrooted trees for four species (Fig. 1B). The true tree is the unrooted star tree of Fig. 1B, with . The best-fitting parameter values (the MLEs of branch lengths in infinite data) are , for each of the three binary trees (Fig. 1B). As , the three binary trees are different from the star tree and represent equally wrong and distinct models (with in Eq. 1). As this is a type-3 problem (Fig. 2, ), our theory predicts that as , the posterior probabilities for the three binary trees should degenerate into a three-point distribution, with probability each, for (1, 0, 0), (0, 1, 0), and (0, 0, 1). In other words, one of the binary trees will have posterior while the other two will have . This is confirmed by simulation (Table S1).
We note that most phylogenetic analyses involve unrooted trees as the clock assumption is violated except for closely related species. Furthermore, because of the violation of the evolutionary model, all trees (or the joint tree-process models) represent wrong statistical models. Thus, among the three cases considered in Fig. 4, case C is the most relevant to analysis of real data, when Bayesian model selection exhibits type-3 polarized behavior. Previous analyses of the star-tree paradox (12, 14, 15) have deplored the volatile behavior of the Bayesian phylogenetic method, but those studies examined case A only, so the real situation is worse than previously realized.
A practically important scenario is where all binary trees are wrong because of violation of the evolutionary model but the true tree is less wrong than the other trees. We present such a case in Table S2, in which the data are simulated under JC+ (with ) using a binary tree with a short internal branch () and then analyzed under JC. When the amount of data approaches infinity, the true tree will eventually win, but there exists a twilight zone in which high posterior probabilities for wrong trees occur too frequently; according to Table S2, this zone is wider than . For example, at sequence length and at the 1% nominal level, the error rate of rejecting the true tree is 25.0% and the error rate of accepting a wrong tree is 16.6% (Table S2).
Discussion
High Posterior Probabilities for Phylogenetic Trees.
This work has been motivated by the phylogeny problem and in particular by the empirical observation of spuriously high posterior probabilities for phylogenetic trees (9–14). We note that certain biological processes such as deep coalescence (24, 25), gene duplication followed by gene loss (26), and horizontal gene transfer (24, 26) may cause different genes or genomic regions to have different histories. However, as discussed in the Introduction, posterior probabilities for many trees or clades observed in real data analyses are decidedly spurious even if the true tree is unknown.
One explanation for the spuriously high posterior probabilities for phylogenetic trees is the failure of current evolutionary models to accommodate interdependence among sites in the sequence, leading to an exaggeration of the amount of information in the data. Interacting sites may carry much less information than independent sites. This explanation predicts the problem to be more serious in coding genes than in noncoding regions of the genome as noncoding sites may be evolving largely independently due to lack of functional constraints. However, empirical evidence points to the opposite, with noncoding regions having higher substitution rates and higher information content (if they are not saturated with substitutions), generating more extreme posteriors for trees.
Our results suggest that the problem may lie deeper and may be a consequence of the polarized nature of Bayesian model selection when all models under comparison are misspecified. As the assumptions about the process of sequence evolution are unrealistic, the likelihood model is wrong whatever the tree, although the true tree may be expected to be less wrong than the other trees. As the different trees constitute opposing models that are nearly equally wrong, the inference problem is one of type 3 (Fig. 2, ). Bayesian tree estimation may then be expected to produce extreme posterior probabilities in large datasets.
Bayesian Selection of Opposing Misspecified Models.
We have provided a characterization of model selection problems according to the asymptotic behavior of the Bayesian method as the data size [Fig. 2 and General Theory for Equally Wrong Models with No Free Parameters (d=0) and General Theory for Equally Right or Equally Wrong Models with Free Parameters (d>0)]. While all of the problems considered here involve comparison of two equally right or equally wrong models, three different asymptotic behaviors are identified, which we label as type 1, type 2, and type 3. The type-1 behavior is for the posterior model probability to converge to a sensible point value, such as . We consider this to be a good balanced behavior, following phylogeneticists (10–12). The rationale is that one would like a sure answer given an infinite amount of data and the only reasonable sure answer should be for each model, since the data contain no information for or against either model. This behavior occurs only when the two models are identical or overlapping, a situation that does not appear relevant to scientific inference. With type-2 behavior, fluctuates among datasets (each of infinite size) like a random number, so that strong support may be attached to a particular model in some datasets. Biologists were surprised at this erratic behavior (10–12), which we label as volatile. This occurs when the models are equally right or equally wrong but indistinct. In theory, type-2 behavior may not pose a serious problem, because the parameter posteriors under the models, if examined carefully, should make it clear that the competing models essentially gave the same interpretation of the data and should lead to the same scientific conclusion. In data simulated in ref. 12 or in Fig. 4 A and A´, the estimates of should be very close to 0, and all binary trees are similar to the same star tree. Nevertheless this escaped our attention at the time.
With type-3 behavior, is in half of the datasets and in the other half. We describe this behavior as polarized. This occurs when the two models are equally wrong and distinct. Type-3 problems may be the most relevant to practical data analysis given that all models are simplified representations of reality and are thus wrong. A variation to type-3 problems is when one model is only slightly less wrong than another (Fig. 3 A, ii and C, ii and Table S2). While the less wrong model eventually wins in the limit of infinite data, Bayesian model selection is overconfident in large but finite datasets, supporting the more wrong model with high posterior too often.
Note that the question of how the posterior model probability should behave when large datasets are used to compare two equally wrong models is somewhat philosophical and may not have a simple answer. One position is to accept whatever behavior the Bayesian method exhibits. This may be legitimate given that Bayesian theory is the correct probability framework for summarizing evidence in the prior and likelihood. The polarized behavior in type-3 problems may then be seen as a consequence of “user error” (for not including the true model in the comparison), exacerbated by the large data size. In this regard we note that the posterior predictive distribution (27, 28) can be used to assess the general adequacy of any model or the compatibility between the prior and the likelihood, and indeed this has been widely used to assess the goodness of fit of models in phylogenetics (29, 30). Nevertheless, a number of sophisticated and parameter-rich models have been developed for Bayesian phylogenetic analysis, due to three decades of active research (31), and furthermore extreme sensitivity to the assumed model is not a desirable property of an inference method. Seven decades ago, Egon S. Pearson (ref. 32, p.142) wrote that “Hitherto the user has been accustomed to accept the function of probability theory laid down by the mathematicians; but it would be good if he could take a larger share in formulating himself what are the practical requirements that the theory should satisfy in application.” This stipulation may be relevant even today.
Two heuristic approaches have been suggested to remedy the high posterior model probabilities in the context of phylogenies. The first one is to assign nonzero probabilities to multifurcating trees (such as the star tree of Fig. 1) in the prior (11). This is equivalent to assigning some prior probability to the model in the fair-coin example of problem 1. While this resolves the star-tree paradox, it suffers from the conceptual difficulty that the multifurcating trees may not be plausible biologically. The second approach is to let the internal branch lengths in the binary trees become increasingly smaller in the prior when the data size increases (12, 14). This is non-Bayesian in that the prior depends on the size of the data. With both approaches, the posterior is extremely sensitive to the prior (9).
Non-Bayesian Methods.
The phylogeny problem was described by Jerzy Neyman (ref. 33, p. 1) as “a source of novel statistical problems.” In the frequentist framework, the test of phylogeny, or test of nonnested models in general, offers challenging inference problems. Note that in many model selection problems, the model itself is not the focus of interest. For example, when an experiment is conducted to evaluate the effect of a new fertilizer, the sensitivity of the inference to the assumed normal distribution with homogeneous variance may be of concern, but the focus is not on the normal distribution itself. In phylogenetics, the phylogeny (which is a model) is of primary interest, far more important than the branch lengths (which are parameters in the model). The test of phylogeny is thus more akin to significance/hypothesis testing than to model selection. Model-selection criteria such as Akaike information criteria (34) or Bayesian information criteria (35) simply rank the trees by their likelihood (maximized over branch lengths) and will not be useful for attaching a measure of significance or confidence in the estimated tree. The phylogeny problem (or the problem of comparing nonnested models in general) falls outside the Fisher–Neyman–Pearson framework of hypothesis testing, which involves two nested models, one of which is true (36, 37).
In principle Cox’s likelihood-ratio test (38), which conducts multiple tests with each model used as the null, can be used to compare nonnested models. For type-3 problems (Fig. 2, –), this test should lead to rejection of all models. Cox’s test has not been used widely in phylogenetics, apparently because of the existence of a great many possible trees and the heavy computation needed to generate the null distribution by simulation.
The most commonly used method for attaching a measure of confidence in the maximum-likelihood tree is the bootstrap (39), which samples sites (alignment columns) to generate bootstrap pseudodatasets and calculates the bootstrap support value for a clade (a node on the species tree) as the proportion of the pseudodatasets in which that node is found in the inferred ML tree. This application of bootstrap for model comparison appears to have important differences from the conventional bootstrap for calculating the standard errors and confidence intervals for a parameter estimate (40); a straightforward interpretation of the bootstrap support values for trees remains elusive (31, 41–43). At any rate, the asymptotic behavior of bootstrap support values under the different scenarios of Fig. 2 merits further research. For the fair-coin example of problem 1 (Fig. 2, ), the bootstrap support converges to , different from the posterior probability, although other cases are yet to be explored.
Materials and Methods
Star-Tree Simulations.
For Fig. 4 A, A´, B, and B´, the true tree is of Fig. 1A. The data of counts of five site patterns (, , , , and ) were simulated by multinomial sampling (21) and analyzed using a C program, which calculates the 2D integrals in the marginal likelihood by Gaussian-Legendre quadrature with 128 points (14). For Fig. 4 C and C´, the true tree is of Fig. 1B. Sequence alignments were simulated using EVOLVER and analyzed using MrBayes (4).
Supplementary Material
Acknowledgments
We thank Philip Dawid and Wally Gilks for stimulating discussions and Jeff Thorne and an anonymous reviewer for constructive comments. Z.Y. was supported by a Biotechnological and Biological Sciences Research Council grant (BB/P006493/1) and in part by the Radcliffe Institute for Advanced Study at Harvard University. T.Z. was supported by Natural Science Foundation of China grants (31671370, 31301093, 11201224, and 11301294) and a grant from the Youth Innovation Promotion Association of the Chinese Academy of Sciences (2015080).
Footnotes
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1712673115/-/DCSupplemental.
References
- 1.Rannala B, Yang Z. Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference. J Mol Evol. 1996;43:304–311. doi: 10.1007/BF02338839. [DOI] [PubMed] [Google Scholar]
- 2.Mau B, Newton M. Phylogenetic inference for binary data on dendrograms using Markov chain Monte Carlo. J Comput Graph Stat. 1997;6:122–131. [Google Scholar]
- 3.Li S, Pearl D, Doss H. Phylogenetic tree reconstruction using Markov chain Monte Carlo. J Am Stat Assoc. 2000;95:493–508. [Google Scholar]
- 4.Ronquist F, et al. Mrbayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 2012;61:539–542. doi: 10.1093/sysbio/sys029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bouckaert R, et al. Beast 2: A software platform for Bayesian evolutionary analysis. PLoS Comput Biol. 2014;10:e1003537. doi: 10.1371/journal.pcbi.1003537. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lartillot N, Lepage T, Blanquart S. PhyloBayes 3: A Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics. 2009;25:2286–2288. doi: 10.1093/bioinformatics/btp368. [DOI] [PubMed] [Google Scholar]
- 7.Chen MH, Kuo L, Lewis P. Bayesian Phylogenetics: Methods, Algorithms, and Applications. Chapman and Hall/CRC; London: 2014. [Google Scholar]
- 8.Bourlat SJ, et al. Deuterostome phylogeny reveals monophyletic chordates and the new phylum xenoturbellida. Nature. 2006;444:85–88. doi: 10.1038/nature05241. [DOI] [PubMed] [Google Scholar]
- 9.Yang Z. Empirical evaluation of a prior for Bayesian phylogenetic inference. Philos Trans R Soc Lond B. 2008;363:4031–4039. doi: 10.1098/rstb.2008.0164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Suzuki Y, Glazko G, Nei M. Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics. Proc Natl Acad Sci USA. 2002;99:16138–16143. doi: 10.1073/pnas.212646199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lewis P, Holder M, Holsinger K. Polytomies and Bayesian phylogenetic inference. Syst Biol. 2005;54:241–253. doi: 10.1080/10635150590924208. [DOI] [PubMed] [Google Scholar]
- 12.Yang Z, Rannala B. Branch-length prior influences Bayesian posterior probability of phylogeny. Syst Biol. 2005;54:455–470. doi: 10.1080/10635150590945313. [DOI] [PubMed] [Google Scholar]
- 13.Steel MA, Matsen F. The Bayesian “star paradox” persists for long finite sequences. Mol Biol Evol. 2007;24:1075–1079. doi: 10.1093/molbev/msm028. [DOI] [PubMed] [Google Scholar]
- 14.Yang Z. Fair-balance paradox, star-tree paradox and Bayesian phylogenetics. Mol Biol Evol. 2007;24:1639–1655. doi: 10.1093/molbev/msm081. [DOI] [PubMed] [Google Scholar]
- 15.Susko E. On the distributions of bootstrap support and posterior distributions for a star tree. Syst Biol. 2008;57:602–612. doi: 10.1080/10635150802302468. [DOI] [PubMed] [Google Scholar]
- 16.Dawid A. Posterior model probabilities. In: Bandyopadhyay PS, Forster M, editors. Philosophy of Statistics. Elsevier; New York: 2011. pp. 607–630. [Google Scholar]
- 17.Berk R. Limiting behavior of posterior distributions when the model is incorrect. Ann Math Stat. 1966;37:51–58. [Google Scholar]
- 18.White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. [Google Scholar]
- 19.Yang Z, Goldman N, Friday AE. Maximum likelihood trees from DNA sequences: A peculiar statistical estimation problem. Syst Biol. 1995;44:384–399. [Google Scholar]
- 20.Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
- 21.Yang Z. Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Syst Biol. 1994;43:329–342. [Google Scholar]
- 22.Jukes T, Cantor C. Evolution of protein molecules. In: Munro H, editor. Mammalian Protein Metabolism. Academic; New York: 1969. pp. 21–123. [Google Scholar]
- 23.Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol. 1993;10:1396–1401. doi: 10.1093/oxfordjournals.molbev.a040082. [DOI] [PubMed] [Google Scholar]
- 24.Maddison W. Gene trees in species trees. Syst Biol. 1997;46:523–536. [Google Scholar]
- 25.Xu B, Yang Z. Challenges in species tree estimation under the multispecies coalescent model. Genetics. 2016;204:1353–1368. doi: 10.1534/genetics.116.190173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Nichols R. Gene trees and species trees are not the same. Trends Ecol Evol. 2001;16:358–364. doi: 10.1016/s0169-5347(01)02203-0. [DOI] [PubMed] [Google Scholar]
- 27.Roberts H. Probabilistic prediction. J Am Stat Assoc. 1965;60:50–62. [Google Scholar]
- 28.Box G. Sampling and Bayes’ inference in scientific modelling and robustness. J R Stat Soc A. 1980;143:383–430. [Google Scholar]
- 29.Sullivan J, Joyce P. Model selection in phylogenetics. Annu Rev Ecol Evol Syst. 2005;36:445–466. [Google Scholar]
- 30.Rodrigue N, Philippe H, Lartillot N. Assessing site-interdependent phylogenetic models of sequence evolution. Mol Biol Evol. 2006;23:1762–1775. doi: 10.1093/molbev/msl041. [DOI] [PubMed] [Google Scholar]
- 31.Yang Z. Molecular Evolution: A Statistical Approach. Oxford Univ Press; Oxford: 2014. [Google Scholar]
- 32.Pearson E. The choice of statistical tests illustrated on the interpretation of data classed in the 2x2 table. Biometrika. 1947;34:139–167. doi: 10.1093/biomet/34.1-2.139. [DOI] [PubMed] [Google Scholar]
- 33.Neyman J. Molecular studies of evolution: A source of novel statistical problems. In: Gupta SS, Yackel J, editors. Statistical Decision Theory and Related Topics. Academic; New York: 1971. pp. 1–27. [Google Scholar]
- 34.Akaike H. Information theory and an extension of the likelihood principle. In: Petrov BN, Csaki F, editors. Proceedings of the Second International Symposium of Information Theory. Akademiai Kiado; Budapest: 1973. pp. 267–281. [Google Scholar]
- 35.Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–464. [Google Scholar]
- 36.Lehmann E. Testing Statistical Hypothesis. 2nd Ed Springer; New York: 1997. [Google Scholar]
- 37.Goldman N, Anderson J, Rodrigo A. Likelihood-based tests of topologies in phylogenetics. Syst Biol. 2000;49:652–670. doi: 10.1080/106351500750049752. [DOI] [PubMed] [Google Scholar]
- 38.Cox D. Tests of separate families of hypotheses. Proc 4th Berkeley Symp Math Stat Prob. 1961;1:105–123. [Google Scholar]
- 39.Felsenstein J. Confidence limits on phylogenies: An approach using the bootstrap. Evolution. 1985;39:783–791. doi: 10.1111/j.1558-5646.1985.tb00420.x. [DOI] [PubMed] [Google Scholar]
- 40.Efron B, Tibshirani R. An Introduction to the Bootstrap. Chapman and Hall; London: 1993. [Google Scholar]
- 41.Felsenstein J, Kishino H. Is there something wrong with the bootstrap on phylogenies? A reply to Hillis and Bull. Syst Biol. 1993;42:193–200. [Google Scholar]
- 42.Efron B, Halloran E, Holmes S. 1996. Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci USA 93:7085–7090, and correction (1996) 93:13429–13434.
- 43.Susko E. Bootstrap support is not first-order correct. Syst Biol. 2009;58:211–223. doi: 10.1093/sysbio/syp016. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




