Assessing Combinability of Phylogenomic Data Using Bayes Factors

Suman Neupane; Karolina Fučíková; Louise A Lewis; Lynn Kuo; Ming-Hui Chen; Paul O Lewis

doi:10.1093/sysbio/syz007

. 2019 Feb 6;68(5):744–754. doi: 10.1093/sysbio/syz007

Assessing Combinability of Phylogenomic Data Using Bayes Factors

Suman Neupane ^1,³, Karolina Fučíková ², Louise A Lewis ³, Lynn Kuo ⁴, Ming-Hui Chen ⁴, Paul O Lewis ^3,^✉

Editor: Robert Thomson

PMCID: PMC7967903 PMID: 30726954

Abstract

With the rapid reduction in sequencing costs of high-throughput genomic data, it has become commonplace to use hundreds of genes to infer phylogeny of any study system. While sampling a large number of genes has given us a tremendous opportunity to uncover previously unknown relationships and improve phylogenetic resolution, it also presents us with new challenges when the phylogenetic signal is confused by differences in the evolutionary histories of sampled genes. Given the incorporation of accurate marginal likelihood estimation methods into popular Bayesian software programs, it is natural to consider using the Bayes Factor (BF) to compare different partition models in which genes within any given partition subset share both tree topology and edge lengths. We explore using marginal likelihood to assess data subset combinability when data subsets have varying levels of phylogenetic discordance due to deep coalescence events among genes (simulated within a species tree), and compare the results with our recently described phylogenetic informational dissonance index (D) estimated for each data set. BF effectively detects phylogenetic incongruence and provides a way to assess the statistical significance of D values. We use BFs to assess data combinability using an empirical data set comprising 56 plastid genes from the green algal order Volvocales. We also discuss the potential need for calibrating BFs and demonstrate that BFs used in this study are correctly calibrated.

Keywords: Bayes factor, Bayes factor calibration, concatenation, marginal likelihood, phylogenetic dissonance, phylogenetics, phylogenomics

Introduction

Until recently, common practice for inferring multigene phylogenies involved concatenation of all available genes with an assumption that the evolutionary histories of all sampled genes were identical. However, phylogenetic trees for different genes (gene trees) can differ from each other, from the tree inferred from the concatenated data, and from the true species tree, due to evolutionary events/processes such as incomplete lineage sorting (ILS), horizontal transfer, and hybridization (Maddison 1997; Degnan and Rosenberg 2009; Edwards 2009; Mallet et al. 2016). Further, even if the sampled genes share the same evolutionary history, estimated trees can differ because of: 1) insufficient phylogenetic information in the sampled genes (stochastic or sampling error) or 2) model misspecification (systematic error) leading to, for example, long edge attraction in some gene trees and not in others (Swofford et al. 1996; Philippe et al. 2005; Philippe et al. 2011).

With the recent surge of large-scale genomic DNA data from high-throughput sequencing methods, the issue of phylogenetic incongruence has become even more important in phylogeny reconstruction. Inferring species trees by addressing these challenges has become an area of active research in phylogenetics. Several species tree methods already available (reviewed in Liu et al. 2015) are effective in correcting incongruences due to deep coalescence (e.g., Song et al. 2012; Jarvis et al. 2014; Xi et al. 2014; Tang et al. 2015). These methods estimate a species tree either from multiple sequence alignments (e.g., *BEAST, Heled and Drummond 2010; BEST, Liu et al. 2008; SVDquartets in PAUP* v.4a137+, Swofford 2003; Chifman and Kubatko 2014; Chifman and Kubatko 2015) or summary statistics calculated from estimated gene trees (e.g., STEM, Kubatko et al. 2009; MP-EST, Liu et al. 2010; BUCKy, Ané et al. 2007; ASTRAL, Mirarab et al. 2014). Methods such as *BEAST and BEST simultaneously estimate gene trees and the species tree by using MCMC to integrate over trees and substitution model parameters; however, coestimation of species and gene trees under a multispecies coalescent model is computationally intensive and cannot be applied to large scale genomic data. On the other hand, fast and efficient summary statistic methods (e.g., Mirarab et al. 2016) that completely rely on estimated gene trees (partial data) for the downstream species tree estimation may be prone to systematic bias as they do not incorporate uncertainty in the gene tree estimation process. SVDQuartets stands out as a method that is computationally tractable, infers species trees from sequence data directly rather than depending on accurately estimated gene trees, and explicitly models ILS. However, even if methods exist that accurately estimate species trees in the face of deep coalescence, there is still a need for methods that identify incongruent data subsets resulting from processes that cause conflict among genes but which are not explicitly modeled.

Tests for Phylogenetic Congruence using Bayes Factors

The only direct tests of phylogenetic congruence from the multiple sequence alignments and explicitly through the model testing approach proposed to date are likelihood ratio tests (LRTs). Huelsenbeck and Bull (1996) proposed a parametric bootstrapping approach in which the null hypothesis constrained all data subsets to have the same tree topology, while the alternative (unconstrained) hypothesis allowed each subset to have a potentially different tree topology. The distribution of the test statistic was generated by simulating data sets under the null hypothesis using maximum likelihood estimates of all model parameters and computing the test statistic under each simulated data set.

Nonparametric boostrapping, in conjunction with LRTs, was used by Leigh et al. (2008) to test the same null hypothesis. Leigh et al. (2008) also proposed clustering of data subsets based on pairwise LRT results to generate compatible sets. Separate LRTs were also proposed by Leigh et al. (2008) to test for heterotachy: in this case, the null hypothesis constrains edge lengths to be proportionally identical across subsets, while the alternative hypothesis allows each subset to potentially have different edge lengths. The software CONCATERPILLAR (Leigh et al. 2008) may be used to carry out these nonparametric bootstrapping LRTs.

These LRTs are well justified and are the best available means to assess congruence when there are no priors involved in the tree estimation process. However, when the phylogeny estimation involves Bayesian methods, then evaluation of congruence should properly account for the effects of the assumed prior distributions. We propose a Bayesian approach to testing phylogenetic congruence (or, equivalently, dissonance) by comparing the marginal likelihoods of competing models. When only two models are compared, the ratio of marginal likelihoods is termed the Bayes factor (BF). Our approach is comparable to that of Leigh et al. (2008), but instead of comparing maximized log-likelihoods of competing models using LRTs, we use marginal likelihoods and their ratio (BF) for model comparison. Our approach is made possible by recent improvements in marginal likelihood estimation for phylogenetic model selection (stepping-stone, SS: Xie et al. 2010; Fan et al. 2011; path-sampling, PS: Lartillot and Philippe 2006; partition weighted kernel estimator: Wang et al. 2018). The SS and PS estimators substantially outperformed other approaches [e.g., harmonic mean estimator, and a posterior simulation-based analog of the Akaike information criterion through Markov chain Monte Carlo (MCMC)] for comparing models of demographic change and relaxed molecular clocks (Baele et al. 2012). Recently, Brown and Thomson (2016) also used BF to analyze sensitivity in clade resolution to the data types used to infer the topology.

Quantifying Topological Dissonance Through Entropies of Posterior Tree Distributions

Lewis et al. (2016) introduced Bayesian methods for measuring the phylogenetic information content of data and for measuring the degree of phylogenetic informational dissonance among data subsets. Phylogenetic dissonance is relevant to the problem of identifying congruent subsets of loci. When data are partitioned into subsets (corresponding to, for example, genes or codon positions), such tools yield insight into which data subsets have the greatest potential for producing well supported estimates of phylogeny. Conflict between different subsets with respect to tree topology can lead to paradoxical results with respect to both information content and estimated phylogeny. For example, a tree topology minimally supported by all data subsets may be given maximal support in a concatenated analysis if each subset is highly informative and effectively rules out the trees most supported by other subsets (Lewis et al. 2016). The information measure Inline graphic (phylogenetic dissonance) was introduced by Lewis et al. (2016) to specifically identify such anomalies. Phylogenetic dissonance is defined as

(1)

(2)

where Inline graphic is the entropy of the marginal tree topology posterior distribution for data subset (of subsets), and is the entropy of the distribution estimated from a merged tree sample. Posterior tree samples from separate analyses of each data subset are combined to form the merged tree sample. (Note that this merged tree sample is not the same as a tree sample obtained from a concatenated analysis.) If different data subsets strongly support mutually exclusive tree topologies, then the average entropy of marginal tree topology posterior distributions ( Inline graphic ) will be small while the merged entropy () will be relatively large due to the fact that topology frequencies are more evenly distributed in the merged sample compared to samples from individual subsets, which are each dominated by one tree topology. Lewis et al. (2016) defined and estimated phylogenetic dissonance using this entropy-based measure, but how to evaluate the statistical significance of a given level of phylogenetic dissonance remains an open question.

BF Calibration

It is standard practice to use the value Inline graphic as the critical value determining whether the null model (e.g., CONCATENATED) or the alternative model (e.g., SEPARATE) wins. This makes sense when the prior predictive error probabilities of BF under both models are equal; however, in cases where models differ substantially in their effective dimensions, the distributions of the BF for the two models being compared may not be symmetrical. For example, it is possible that the probability that the BF favors the CONCATENATED model when the SEPARATE model is true may not equal the probability that BF favors the SEPARATE model when the CONCATENATED model is true. Under such circumstances, a different threshold value (other than 1) can be selected such that the probability of choosing the incorrect model under both hypotheses is equal. Subtle, nonintuitive behavior of BFs, especially with respect to models that differ with respect to the size of tree topology space (Bergsten et al. 2013), made us wary of using BF tests right out of the box, so to speak. García-Donato and Chen (2005) provided an example of a model comparison in which calibration was needed and suggested a method for calibrating the BF (adopted here) that makes the prior predictive error probabilities symmetrical.

Aims of this Study

The primary aim of our study is to evaluate the effectiveness of BFs in uncovering phylogenetic discordance and in assessing the significance of the phylogenetic dissonance measure Inline graphic (1). We explore the behavior of BFs using simulations designed to create a spectrum of 10-gene data sets ranging from low to high information content and from complete topological concordance to extreme discordance (due to deep coalescence and subsequent ILS). We provide an empirical example involving concordance of plastid genes in the green algal order Volvocales which demonstrates that LRTs carried out using CONCATERPILLAR can differ from conclusions based on marginal likelihoods when analyses are performed in a Bayesian context.

We investigate whether BFs are correctly calibrated in our particular application: that is does the standard BF=1 cutoff tend to (a priori) favor the SEPARATE model over the CONCATENATED mode, or vice versa?

Finally, we investigate the use of BFs for identifying heterotachy by fixing topology but allowing edge lengths to vary across gene subsets, and also under what circumstances the heterotachy model (hereafter HETERO model) detects true heterotachy versus other phenomena, such as topological discordance (Mendes and Hahn 2016) and heterobathy (different coalescent times rather than rate heterogeneity).

Materials and Methods

Parameters and Prior Distributions

Parameters of the models and the priors used in the GTR+G models employed in this article are:

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

Inline graphic equals the total number of distinct unrooted, binary, labeled tree topologies for taxa. There are such topologies comprising tree topology space . The space of edge length proportions is , which is the -simplex, where is the number of edge lengths in any tree topology in . The space of all other parameters Inline graphic is , which in this article includes the 6-simplex for GTR exchangeabilities, the 4-simplex for nucleotide frequencies, and for both tree length and the discrete gamma shape parameter quantifying the amount of among-site rate heterogeneity. Nucleotide frequencies were ordered and exchangeabilities were ordered Inline graphic . The Relative Rate Distribution is a transformed Dirichlet distribution and was described in Fan et al. (2011).

Bayes Factors

In Bayes’ Rule,

(11)

the denominator represents the marginal likelihood conditional on model Inline graphic , . The marginal likelihood is the total probability of data , averaged over tree topology , a multivariate parameter vector of edge length proportions, and a multivariate parameter vector comprising all other model parameters (including tree length). Data is a vector comprising observed patterns of states for all taxa for individual characters (sites in the case of sequence data). Considering two models, ( Inline graphic , ), and their marginal likelihoods, and , respectively, the BF is the ratio . The BF on the log-scale is calculated as:

(12)

where Inline graphic 0 signifies that model is preferred over . By preferred, we mean that model fits the data better on average than model over the parameter- and tree-space defined by the prior. Applying this approach to the problem of phylogenetic congruence, consider data from a set of loci (, Inline graphic ,...), and two models, CONCATENATED and SEPARATE. The CONCATENATED model represents the marginal likelihood of the concatenated set () in which all loci are forced to have the same topology () and edge length proportions (), but allowed to differ in other model parameters ():

(13)

In contrast, the SEPARATE model represents the marginal likelihood for a model in which each individual locus Inline graphic () is allowed to have its own topology (), edge length proportions (), and model parameters (),

(14)

The BF for CONCATENATED against SEPARATE is defined

(15)

Inline graphic (or equivalently ) indicates that the CONCATENATED model (numerator) is preferred over the SEPARATE model (denominator), whereas () indicates the reverse (i.e., SEPARATE model is the preferred model).

A third, intermediate model HETERO links topology across subsets but allows edge lengths to vary between single-gene data sets:

(16)

Data Simulation

Gene trees were simulated within species trees using parameter combinations chosen to yield differing levels of phylogenetic incongruence. Using a Python script (source code provided in Supplementary Materials available on Dryad at http://dx.doi.org/10.5061/dryad.gt81779), one thousand 6-taxon species trees were generated under a pure-birth (Yule) process in which the tree height Inline graphic (expected number of substitutions along a single path from root to tip) was drawn from a Lognormal(0.05, 0.22) distribution (mean 1.08, 95% of samples between 0.68 and 1.62). Ten gene trees were simulated within each species tree using coalescent parameter , where is the effective (diploid) population size and Inline graphic is the mutation rate per generation. For each species tree, the ratio was drawn from a Lognormal(0.60, 0.77) distribution (which has mean 2.45 with 95% of samples between 0.40 and 8.24) and was determined by multiplying this ratio by the value of used for a specific species tree. Increasing Inline graphic relative to results in a higher number of deep coalescences, causing increased discordance among the gene trees.

The gene trees thus generated were subsequently used to simulate DNA sequence alignments of length 2000 sites using SEQ-GEN (Rambaut and Grass 1997) under the GTR+G model (equal nucleotide relative frequencies, exchangeabilities Inline graphic = Dirichlet(1,2,1,1,2,1), discrete Gamma shape parameter ). Individual single-gene data sets and the concatenated data set were used to compute marginal likelihoods using the Stepping-stone method (Xie et al. 2010) implemented in MrBayes (Ronquist et al. 2012). For the concatenated data set, two marginal likelihoods were estimated by enforcing: 1) the same topology and edge lengths for all sites (CONCATENATED model) and 2) the same topology but allowing edge lengths to vary among single-gene data subsets to account for nontopological gene tree variation (HETERO model). The fact that the marginal likelihood for the separate model (14) involves the product of Inline graphic independent terms (each corresponding to a different locus and each integrating out all unobserved quantities, including the tree topology), means that each term can be estimated using an analysis of one locus in isolation. The total log marginal likelihood for the SEPARATE model can be obtained as a sum of log marginal likelihoods from each locus analyzed separately.

The BF results were evaluated with respect to the phylogenetic information content ( Inline graphic ) and phylogenetic dissonance () values computed using Galax v1.0.0 (Lewis et al. 2016). Estimation of and uses conditional clade probabilities (Larget 2013) to estimate Shannon entropy (Shannon 1948), from which is calculated simply as a difference between the entropies of the marginal prior and marginal posterior distributions of tree topology (Lindley 1956). The phylogenetic dissonance is defined as in equation (1), and thus Inline graphic is computed as the entropy of the merged tree sample minus the average entropy of tree samples from individual genes. Phylogenetic dissonance is expected to be zero for comparisons of independent MCMC samples from the same posterior distribution, and thus provides a sensitive measure of MCMC convergence with respect to tree topology (Lewis et al. 2016). We replicated each single-gene and concatenated MCMC analysis and computed Inline graphic for these paired samples as a way of ensuring that post burn-in MCMC sample size was sufficient for convergence.

Example from the Green Algal Order Volvocales

We tested phylogenetic congruence among 56 protein-coding plastid genes used in Fučíková et al. (2016), focusing on one of the most topologically consistent parts of the tree, the green algal order Volvocales. The Volvocales data set consisted of a subset of the Sphaeropleales, Vovocales, and OCC (Oedogoniales-, Chaetopeltidales-, and Chaetophorales) clades studied in Fučíková et al. (2016). We included four of the five Volvocales members from the study: Chlamydomonas reinhardtii, Gonium pectorale, Pleodorina starrii, and Volvox carteri. These four genomes were originally published by Smith et al. (2013). The length of the plastid genes ranged from 93 sites (psbT) to 2259 sites (psaA) after trimming portions with a high proportion of missing data at the ends. We conducted BF tests for all possible pairs from the 56 genes (by estimating marginal likelihoods under the CONCATENATED, HETERO, and SEPARATE models) used in the study with the aim to detect possible outlier genes that may be present among the sampled genes for the concatenated phylogeny. In order to compare our results with the likelihood-based approach, we also tested congruence among these 56 genes using CONCATERPILLAR (Leigh et al. 2008) using its topological congruence test (-t) option.