Estimation of Phylogeny Using a General Markov Model

Vivek Jayaswal; Lars S Jermiin; John Robinson

. 2007 Feb 25;1:62–80.

Estimation of Phylogeny Using a General Markov Model

Vivek Jayaswal ¹, Lars S Jermiin ^2,^✉, John Robinson ³

PMCID: PMC2658871 PMID: 19325854

Abstract

The non-homogeneous model of nucleotide substitution proposed by Barry and Hartigan (Stat Sci, 2: 191–210) is the most general model of DNA evolution assuming an independent and identical process at each site. We present a computational solution for this model, and use it to analyse two data sets, each violating one or more of the assumptions of stationarity, homogeneity, and reversibility. The log likelihood values returned by programs based on the F84 model (J Mol Evol, 29: 170–179), the general time reversible model (J Mol Evol, 20: 86–93), and Barry and Hartigan’s model are compared to determine the validity of the assumptions made by the first two models. In addition, we present a method for assessing whether sequences have evolved under reversible conditions and discover that this is not so for the two data sets. Finally, we determine the most likely tree under the three models of DNA evolution and compare these with the one favoured by the tests for symmetry.

Keywords: Phylogenetics, Maximum Likelihood, Reversibility, Tests for Symmetry, Nucleotide Sequence Evolution

1. Introduction

The evolutionary relationship between a set of k homologous sequences of N nucleotides can be represented by a k-leaved bifurcating tree where each leaf node represents a known sequence and each internal node represents an ancestral sequence (which is almost always unknown). The phylogeny of the k sequences can be inferred by using maximum-likelihood methods, which rely on models of nucleotide substitution to infer the most likely tree. Popular phylogenetic methods, like those implemented in PHYLIP (Felsenstein 2004a), PAUP* (Swofford 2002), and Tree-Puzzle (Schmidt et al 2002), use models of nucleotide substitution that assume the evolutionary process is stationary, homogeneous, and reversible. Although a detailed mathematical description of stationarity, homogeneity and reversibility can be found in Ababneh et al (2006a), we will give a brief description of these terms in the context of molecular phylogenetics. Stationarity implies that the marginal probabilities of the four nucleotides remain constant over all the nodes of a given tree. Homogeneity implies that the instantaneous rate matrix (described in eg, Lanave et al 1984, Kishino and Hasegawa 1989) is constant over an edge (local homogeneity) or constant over the entire tree (global homogeneity). Reversibility implies that the probability of sampling nucleotide i from the stationary distribution and going to nucleotide j is the same as the probability of sampling nucleotide j from the stationary distribution and going to nucleotide i, where i, j = {A,C,G,T} (Bryant et al 2004). Reversibility, therefore, implies that the process is stationary and permits us to ignore the direction of evolution. The assumptions of stationarity, homogeneity and reversibility are often violated by the data, resulting in an elevated probability of incorrect phylogenetic results (for examples of the complexity of the problem, see Ho and Jermiin 2004; Jermiin et al 2004).

In a landmark article, Barry and Hartigan (1987a) considered a general Markov model for unrooted trees, the assumptions being that the process relating each pair of nodes in the tree is Markovian and the sites are independent and identically distributed. Their model does not make the assumption of stationarity, homogeneity (local or global) or reversibility, so it is more general than the non-stationary but locally homogeneous models considered by Yang (1994) and Yang and Roberts (1995). Barry and Hartigan (1987a) considered a k-taxa tree with 2k−3 Q-matrices, one for each edge of the tree. Each Q-matrix represents the joint probability distribution of nucleotides at the two ends of the associated edge and is a 4 × 4 matrix. Since the sequences at internal nodes are not known, we can only observe the 4^k different combinations of nucleotides at the leaf nodes. These combinations together represent the joint probability distribution of the nucleotides at leaf nodes and can be written as a function of the Q-matrices. Thus, the likelihood of the observed sequences is a function of the set of Q-matrices and can be maximised by determining the maximum-likelihood estimates of the Q-matrices. The algorithm for obtaining the maximum-likelihood estimates was suggested by Barry and Hartigan (1987a) but has not received the attention that other aspects of their paper have, especially calculation of LogDet distance (Lockhart et al 1994; Steel 1994), probably because the large number of parameters was assumed to make the interpretation difficult.

We revisit Barry and Hartigan’s (1987a) model, and describe it and the estimation algorithm in new notation — we also present a program written in Java^™ to implement it. We examine the information that can be obtained from the estimates by applying their algorithm, henceforth referred to as the BH algorithm, to two sets of data, one comprising mitochondrial DNA from seven hominoids, where there is apparent stationarity and homogeneity, and another comprising 16S ribosomal RNA genes from five bacterial genomes, where problems due to lack of stationarity and homogeneity have been noted previously by Galtier and Guoy (1995) as well as Foster (2004). Further, we compare the results obtained from our program with those obtained using simpler models, ie, the F84 model (Kishino and Hasegawa 1989), implemented in DNAML from the PHYLIP program package (Felsenstein 2004a), and the general time reversible (GTR) model (Lanave et al 1984), implemented in PAUP* (Swofford 2002). A likelihood ratio test (Huelsenbeck and Crandall 1997), based on the log likelihood values obtained using the phylogenetic programs, is used to determine whether one or more of the assumptions of stationarity, homogeneity, and reversibility are violated.

The joint probability distribution values for each edge of the tree can be used to determine (a) marginal probabilities at the nodes (internal nodes as well as leaf nodes), and (b) the joint probability distribution of a pair of leaf nodes. The assumption of stationarity can be examined by comparing the marginal probabilities at different leaf nodes (Ababneh et al 2006b). Since Barry and Hartigan’s (1987a) method gives estimates of the joint distribution of the two end points of each edge, we can evaluate the hypothesis of reversibility by examining the joint distribution — it should be symmetric if the process is reversible. We do so for the two sets of data mentioned earlier, obtaining the surprising result that the stationary and homogeneous model for the hominoid data appears to be not reversible along some of the edges. Such comparisons seem to be possible for only a part of the tree for the bacterial data since this data set is not stationary.

2. A General Markov Model on Trees

The general Markov model for phylogenetic trees proposed by Barry and Hartigan (1987a) will be given using a notation that permits a more compact description. Consider an unrooted binary tree, T, (for definitions, see Chapter 1 of Semple and Steel (2003)) with l leaves, l – 2 internal nodes (or vertices), and 2l − 3 edges, for l ≥ 0. For convenience, we include l = 1 with 0 internal nodes and 0 edges. Denote leaves by L = {−1, …, −l} and internal nodes by I = {1, …, l − 2} (the notation of positives and negatives derives from the merge matrix given by the hierarchical clustering algorithm, hclust, in the S-PLUS or R packages). The set of all vertices is V = L ∪ I. Denote edges by E = {(i, j): i,j∈V and adjacent}. By inserting a node numbered 0, called a root node, on any edge, and thus increasing the number of nodes and the number of edges by one, the unrooted tree can be converted into a rooted binary tree. If an edge (i, j) of the unrooted tree is deleted then two rooted sub-trees T₍_i_,_j₎ and T₍_j_,_i₎ are formed with roots at i and j, respectively.

The tree will be used to describe a model for evolutionary relationships at a site in the DNA, as in Barry and Hartigan (1987a), by considering the joint distribution of the four bases B = {A,C,G,T} at the leaves. First consider the joint distribution at ends of any edge. Let X_i and X_j be the values taken by bases at nodes i and j of the edge (i, j). Write

Q_{(i, j)} (x, y) = P (X_{i} = x, X_{j} = y)

(1)

for x, y ∈ B, as the joint probability. Note that, since consistency of marginal distributions at internal nodes is required, for i ∈ I,

P (X_{i} = x) = Q_{i} (x) = \sum_{y \in B} Q_{(i, j)} (x, y)

for any j such that (i, j) is an edge.

More generally, let X = (X_L, X_I) denote the vector of random variables with X_L = (X₋₁, …, X_−l) and X_I = (X₁,…,X_l₋₂), and with each Xi taking values in B. Let Q_T (x) = P(X = x) be the joint distribution of the bases at the nodes of T. The joint distribution of the bases at the leaves is then

Q_{L} (x_{L}) = \sum_{x_{i} : i \in I} Q_{T} (x) .

(2)

Further, if L₍_i_, _j₎ = L ∩ T₍_i_, _j₎ and I₍_i_, _j₎ = I ∩ T₍_i_, _j₎ denote the sets of leaf nodes and internal nodes, respectively, in T₍_i_, _j₎, then X₍_i_, _j₎ and X_T(_i_, _j₎ denote the vectors of the values of bases in L₍_i_, _j₎ and T₍_i_, _j₎, respectively, and

Q_{L_{(i, j)}} (x_{L_{(i, j)}}) = \sum_{x_{k} : k \in I_{(i, j)}} Q_{T_{(i, j)}} (x_{T_{(i, j)}})

(3)

Take the model to be Markovian, so that, given (X_i, X_j) = (x_i, x_j), the conditional distribution of the bases on the leaves of the rooted sub-trees T₍_i_, _j₎ and T₍_j_, _i₎, given by deleting edge (i, j), are independent. Under this Markovian model the joint distribution Q_L(x_L) can be written as a product of terms involving only Q₍_i_, _j₎(x_i, x_j) and Q_i(x_i) for all edges (i, j) and all nodes i. At each site α = 1, …, N the value of a base at the i-th leaf, x_iα is known, but at internal nodes the base can take any value in B. Let B_iα = {x_iα} if i ∈ L, and B if i ∈ I. Then the joint probability distribution of leaf nodes at site α can be expressed as

Q_{L, α} (x_{L}) = \sum_{x_{i} \in B} \sum_{x_{j} \in B} I (x_{i} \in B_{i α}, x_{j} \in B_{j α})

(4)

Q_{(i, j)} (x_{i}, x_{j}) P (L_{(i, j)} | x_{i}) P (L_{(j, i)} | x_{j}) .

where I(x_i ∈ B_iα, x_j ∈ B_jα) is an indicator function that takes the value 1 if both x_i and x_j represent leaf nodes, and 0 otherwise. Also,

P (L_{(i, j)} | x_{i}) = \frac{Q_{L_{(i, j)} \cup {x_{i}}} (x_{L_{(i, j)}}, x_{i})}{Q_{i} (x_{i})}

This formula can be applied recursively to the joint distribution on a smaller tree,

Q_{L_{(i, j)}} (x_{L_{(i, j)}})

until trees with only one edge are reached.

Notice that it is not necessary in this general case to put any restrictions on the model producing the joint distributions on each edge, other than consistency at internal nodes noted earlier.

3. Estimation

For an unrooted binary tree, T, based on k homologous sequences, each having N sites, Barry and Hartigan (1987a) gave a method of estimating the set of Q₍_i_, _j₎(x, y) for x, y ∈ B, (i, j) ∈ E by maximizing the log likelihood of the bases at the leaves. Using (4), the log likelihood for an unrooted tree is

\begin{array}{l} L = \sum_{α = 1}^{N} log Q_{L, α} (X_{L}) \\ = \sum_{α = 1}^{N} log \sum_{x \in B} \sum_{y \in B} I (x \in B_{i α}, y \in B_{j α}) \\ Q_{(i, j)} (x, y) P (L_{(i, j)} | x) P (L_{(j, i)} | y) \end{array}

(5)

Now, maximizing L with respect to Q₍_i_, _j₎(x, y) subject to

\sum_{x, y \in B} Q_{(i, j)} (x, y) = 1

requires equating the derivatives of L + λ (∑_x_, _y_∈ _B Q₍_i_, _j₎(x, y) − 1) with respect to Q₍_i_, _j₎(x, y) and λ to zero, which leads to the updating equation

Q_{(i, j)} (x, y) = \frac{1}{N} \sum_{α = 1}^{N} \frac{I (x \in B_{i α}, y \in B_{j α}) Q_{(i, j)} (x, y) P (L_{(i, j)} | x) P (L_{(j, i)} | y)}{Q_{L} (x_{L})} .

(6)

In order to minimize computational time, suitable initial values are chosen for all Q₍_i_,_j₎(x, y). Then (6) is used repeatedly on all edges to update the left hand side using current values for the right hand side until the process converges. Call the values obtained

{\hat{Q}}_{(i, j)} (x, y) .

3.1. Precise Fit At Leaf Nodes

If i ∈ L, then summing in (6) over y ∈ B gives

{\hat{Q}}_{i} (x) = \frac{N_{i} (x)}{N}

where N_i(x) denotes the number of sites at leaf i that have base x. Thus the maximization leads to a precise fit at the leaves.

3.2. Internal Consistency

If i ∈ I and edges (i, j) and (i, k) are in E, then, if the sum over y in Q̂ ₍_{i, j}₎ (x, y), and over z in Q̂ ₍_i,k,₎ (x,z), are equal, these estimates of the marginal probabilities at internal nodes are consistent. Now from (6) we get

\begin{array}{l} \sum_{y \in B} {\hat{Q}}_{(i, j)} (x, y) = \frac{1}{N} \sum_{α = 1}^{N} \frac{Σ_{y \in B} I (x \in B_{i α}, y \in B_{j α}) {\hat{Q}}_{(i, j)} (x, y) P (L_{(i, j)} | x) P (L_{(j, i)} | y)}{{\hat{Q}}_{L} (x_{L})} \\ = \frac{1}{N} \sum_{α = 1}^{N} \frac{{\hat{Q}}_{L \cup {i}} (x_{L α}, x)}{{\hat{Q}}_{L} (x_{L α})} \end{array}

where x_Lα is the vector of values of bases of leaves at the αth site. The same formula is obtained by summing over z in

{\hat{Q}}_{(i, k)} (x, z)

showing that the estimates are internally consistent.

4. Algorithm Implementation

The BH algorithm was implemented in Java (Java^™ 2 Platform Standard Edition, Version 1.4.2_03) using an object-oriented approach. The main classes in the program are NewickTreeTraversal, BranchDetails and MaximumAverageLikelihood. The class NewickTreeTraversal reads the unrooted tree in Newick format (Felsenstein 2004b) and constructs a binary tree. Each node is linked to a maximum of three nodes ie one parent node and two descendant nodes. The class BranchDetails stores the joint probability distribution values along each edge of the binary tree. The class MaximumAverage Likelihood makes use of the above-mentioned classes to compute the log likelihood values and update joint probability distribution values (using formulae described in Section 3) for a user-specified tree. It also generates an output file containing the final joint probability distribution values along each edge and the log likelihood value for the entire tree. The joint probability distribution values can be used to compute divergence matrices — a helper program has been written in Java^™ for this purpose.

We make use of recursion to compute the joint conditional probability distribution of all the leaf nodes connected to the sub-tree rooted at node i, ie P(L₍_i_, _j₎|x). The method starts by calculating the joint probability of node i and its immediate descendant nodes. If node i is an internal node, x ∈ B and the joint probability distribution is the sum of joint probability values obtained for different nucleotide values at node i. If a descendant node is an internal node, we consider the sub-tree rooted at the descendant node and compute P(L₍_i_,_j₎|x). This process is repeated until the leaf nodes are reached.

The initial joint probability distribution values (ie, the Q-values) along the edges of the binary tree are provided by the user. At the end of each iteration, we compare the Q-values before and after updation. If the sum of the square of differences is greater than the user-specified value, the Q-values are updated and the next iteration begins. If none of the edges need to be updated, it implies that convergence has been achieved, and the program terminates.

To improve the program’s performance, henceforth referred to as the BH program, for a given data set of matched nucleotide sequences, all unique patterns are identified at the beginning of the program. The log likelihood value is computed only once for each unique pattern and the result is multiplied by the number of times a particular pattern occurs — this is a commonly used procedure to reduce the time needed to estimate the likelihood of a tree.

The software will be available for download from http://www.usyd.edu.au/SUBIT/.

4.1. Computation of Edge Length

If the nucleotide sites are independent and identically distributed and the underlying model of evolution is stationary, homogeneous and reversible, we can compute edge lengths (Lanave et al 1984; Tavaré 1986; Rodríguez et al 1990) using the formula

δ_{i j} = - t \sum_{h = 1}^{4} π_{h} r_{h h}

where δ_ij denotes the distance between sequences at nodes i and j in terms of expected number of substitutions per site, t denotes time, π_h denotes the h-th element of the diagonal matrix of stationary probabilities, and r_hh denotes the h-th diagonal element of the rate matrix. A method for determining asynchronous distances was proposed by Barry and Hartigan (1987b). Although their method can be applied to the general model, if the marginal probabilities at the two ends of an edge are different, the distances are asymmetric (ie, for an edge (i, j), the distance from i to j and from j to i are different). In our paper, we have averaged the distances over the two possible directions of traversal for the purpose of edge length comparison with DNAML. Since the BH algorithm is based on joint probability distributions along the edges and does not require branch length optimization, the averaging of branch lengths does not affect the maximum-likelihood computation.

4.2. Variation in log likelihood values

For a given data set of homologous nucleotide sequences, the log likelihood value at convergence depends on the initial set of Q-values. This was observed in both five-taxa and seven-taxa trees irrespective of the tree selected. This indicates the presence of multiple local maxima on the likelihood surface even for the most likely tree. This is an important result because former studies of the problem of multiple maxima on the log likelihood surface have assumed stationary, homogeneous, and reversible models of evolution. Chor et al (2000) showed that even for simple models of evolution, multiple maxima are possible while Rogers and Swofford (1999) used simulation to show that the best tree is unlikely to have multiple maxima.

For the two data sets analysed below, convergence to a local maximum, different from the global maximum, was observed only if the Q-values chosen were extreme; for example, a Q-matrix with all the joint probabilities being equal or a Q-matrix with diagonal elements much smaller than off-diagonal elements. From the Q-matrices that converged to the global maximum, we randomly selected one with a value of 1/8 along the main diagonal and 1/24 elsewhere for the computation of log likelihood values mentioned in section 5.

5. Application to two sets of homologous sequences

Under the Markovian model of DNA evolution, the process of evolution may or may not be stationary and homogeneous. We consider both cases and argue that the general model of DNA evolution proposed by Barry and Hartigan (1987a) is useful in both cases. For each data set, we (i) used three matched-pairs tests of homogeneity (Bowker 1948; Stuart 1955; Ababneh et al 2006b) to determine whether the sequences could be assumed to have evolved under stationary and homogeneous conditions (a prerequisite for using most phylogenetics methods); (ii) determined the degrees of freedom needed in order to compare phylogenetic results using likelihood-ratio tests; (iii) estimated and compared the trees; and (iv) conducted a comparison of edge lengths, divergence matrices and substitutional biases. We show that Barry and Hartigan’s (1987a) method provides a useful reference point for choosing appropriate models of substitution, and the means for assessing whether the evolutionary process is reversible; such a method appears to be unavailable in the current literature.

5.1 Hominoid Data

We considered an alignment of 1809 nucleotides from the mitochondrially-encoded NADH dehydrogenase subunit 5 genes of (with abbreviated name and Genbank Accession numbers given in parentheses): Human (Hsap, NC_001807), Chimpanzee (Ptro, NC_001643), Bonobo (Ppan, NC_001644), Gorilla (Ggor, NC_001645), Orangutan (Ppyg, NC_001646), Gibbon (Hlar, NC_002082), and Macaque (Msyl, NC_002764). The three codon sites were separated into different alignments using a program called CODONSPLIT (by IB Jakobsen) before being analysed.

5.1.1. Assessment of phylogenetic assumptions

The alignments of first, second, and third codon sites were examined independently using the matched-pairs tests of symmetry (Bowker 1948), marginal symmetry (Stuart 1955), and internal symmetry (Ababneh et al 2006b). Given that each of these tests involve multiple comparisons of related sequences, it was necessary to interpret the p-values with caution. The matched-pairs tests of homogeneity produced p-values in the range of 1.000 to 0.024 for the first and second codon sites (Tables 1 and 2), and in the range of 0.996 to 0.006 for the third codon sites (Table 3). For the 21 pairwise comparisons, only 1 p-value was observed to be lower than 0.05 for the first and second codon sites whereas approximately one-fourth of the p-values for the third codon site were found to be lower than 0.05. These results are consistent with evolution under stationary and homogeneous conditions for first and second codon sites but not for third codon sites. Interestingly, all the low p-values observed for third codon sites involved comparisons with Orangutan, indicating real differences.

Table 1.

Probabilities obtained from matched-pairs tests of symmetry, marginal symmetry and internal symmetry using 1st codon sites from the hominoid data

		Ppan	Ptro	Hsap	Ggor	Ppyg	Hlar
Ptro	Bowker	0.206
	Stuart	0.620
	Ababneh	0.425

Hsap	Bowker	0.217	0.709
	Stuart	0.312	0.867
	Ababneh	0.532	0.883

Ggor	Bowker	0.032	0.219	0.302
	Stuart	0.024	0.227	0.243
	Ababneh	0.769	0.994	0.387

Ppyg	Bowker	0.440	0.579	0.614	0.139
	Stuart	0.092	0.095	0.239	0.078
	Ababneh	1.000	1.000	1.000	0.680

Hlar	Bowker	0.400	0.331	0.262	0.180	0.703
	Stuart	0.517	0.419	0.576	0.106	0.696
	Ababneh	0.268	0.404	0.127	0.688	0.499

Msyl	Bowker	0.592	0.584	0.303	0.233	0.635	0.735
	Stuart	0.327	0.304	0.303	0.056	0.242	0.522
	Ababneh	0.759	0.786	0.313	0.914	0.989	0.913

		Ppan	Ptro	Hsap	Ggor	Ppyg	Hlar
	Bowker	0.102
Ptro	Stuart	0.206
	Ababneh	1.000

	Bowker	0.197	0.352
Hsap	Stuart	0.348	0.826
	Ababneh	1.000	0.754

	Bowker	0.264	0.323	0.361
Ggor	Stuart	0.437	0.706	0.334
	Ababneh	1.000	0.352	0.558

	Bowker	0.359	0.446	0.728	0.297
Ppyg	Stuart	0.154	0.243	0.401	0.088
	Ababneh	0.720	0.653	0.879	0.867

	Bowker	0.157	0.444	0.126	0.331	0.165
Hlar	Stuart	0.297	0.721	0.638	0.513	0.177
	Ababneh	0.231	0.329	0.075	0.327	0.239

	Bowker	0.710	0.957	0.890	0.605	0.46	0.801
Msyl	Stuart	0.881	0.996	0.940	0.940	0.494	0.948
	Ababneh	0.378	0.690	0.592	0.248	0.351	0.440

		Ppan	Ptro	Hsap	Ggor	Ppyg	Hlar
	Bowker	0.670
Ptro	Stuart	0.357
	Ababneh	0.846

	Bowker	0.517	0.504
Hsap	Stuart	0.511	0.452
	Ababneh	0.589	0.443

	Bowker	0.257	0.767	0.171
Ggor	Stuart	0.568	0.947	0.459
	Ababneh	0.349	0.398	0.092

	Bowker	0.019	0.028	0.016	0.046
Ppyg	Stuart	0.016	0.029	0.242	0.011
	Ababneh	0.180	0.160	0.010	0.662

	Bowker	0.236	0.277	0.743	0.244	0.756
Hlar	Stuart	0.083	0.135	0.623	0.093	0.535
	Ababneh	0.715	0.584	0.627	0.678	0.748

	Bowker	0.372	0.528	0.383	0.158	0.035	0.445
Msyl	Stuart	0.151	0.261	0.354	0.386	0.006	0.142
	Ababneh	0.996	0.986	0.567	0.100	0.811	0.948

Tree	Log Likelihood
((((((Ptro,Ppan),Ggor),Hsap),Ppyg),Hlar),Msyl)	−3540.684
((((((Ptro,Ppan),Hsap),Ggor),Ppyg),Hlar),Msyl)	−3545.508
(((((Ptro,Ppan),(Hsap,Ggor)),Ppyg),Hlar),Msyl)	−3554.946

Tree	SH Test	AU Test
((((((Ptro,Ppan),Ggor),Hsap),Ppyg),Hlar),Msyl)	0.811	0.716
((((((Ptro,Ppan),Hsap),Ggor),Ppyg),Hlar),Msyl)	0.428	0.334
(((((Ptro,Ppan),(Hsap,Ggor)),Ppyg),Hlar),Msyl)	0.075	0.026

Edge	Distance using BH	Distance using DNAML	Confidence Interval (DNAML)
Ppyg, Node-2	0.058	0.061	0.046–0.077
Node-2, Node-4	0.028	0.024	0.014–0.035
Node-2, Node-3	0.018	0.020	0.011–0.030
Node-4, Hlar	0.037	0.039	0.027–0.053
Node-4, Msyl	0.108	0.109	0.088–0.129
Node-3, Hsap	0.032	0.029	0.019–0.040
Node-3, 5-Node	0.009	0.009	0.003–0.016
Node-5, Ggor	0.043	0.042	0.029–0.055
Node-5, Node-6	0.010	0.009	0.003–0.015
Node-6, Ptro	0.017	0.017	0.009–0.025
Node-6, Ppan	0.016	0.015	0.007–0.022

(a)		A	C	G	T

	A	306	11	18	15
	C	10	279	2	47
	G	20	4	142	2
	T	6	40	2	302

(b)		A	C	G	T

	A	303.7	12.8	21.6	11.9
	C	10.2	270.8	2.1	54.8
	G	21.3	7.5	138.1	1.1
	T	6.8	42.8	2.2	298.2

Edge	Bowker’s Test	Stuart’s Test
Ppyg, Node-2	0.113	0.035
Node-2, Node-4	0.435	0.697
Node-2, Node-3	0.241	0.282
Node-3, Hsap	0.000	0.000
Node-3, Node-5	0.145	0.023
Node-5, Ggor	0.001	0.000
Node-5, Node-6	0.088	0.012
Node-6, Ptro	0.085	0.013
Node-6, Ppan	0.097	0.013
Node-4, Hlar	0.454	0.140
Node-4, Msyl	0.135	0.080

	A	C	G	T
A	325.0	2.0	17.7	3.0
C	2.0	332.4	0.0	18.5
G	2.0	0.0	156.3	0.0
T	0.0	4.6	0.0	342.5

		Apyr	Bsub	Drad	Tthe
	Bowker	0.000
Bsub	Stuart	0.000
	Ababneh	0.295

	Bowker	0.000	0.995
Drad	Stuart	0.000	0.946
	Ababneh	0.754	0.958

	Bowker	0.509	0.000	0.000
Tthe	Stuart	0.731	0.000	0.000
	Ababneh	0.263	0.544	0.863

	Bowker	0.132	0.000	0.000	0.415
Tmar	Stuart	0.325	0.000	0.000	0.267
	Ababneh	0.095	0.417	0.297	0.546

Edge	Distance using BH	Distance using DNAML	Confidence Interval (DNAML)
Bsub, Node-2	0.122	0.127	0.104–0.150
Node-2, Node-3	0.040	0.039	0.024–0.053
Node-3, Tthe	0.060	0.069	0.051–0.087
Node-3, Drad	0.131	0.120	0.098–0.143
Node-2, Node-4	0.036	0.043	0.027–0.058
Node-4, Tmar	0.058	0.061	0.044–0.078
Node-4, Apyr	0.124	0.127	0.104–0.150

Sequence Pair	Tree #1	Tree #2
Bsub-Tmar	3.06	17.94
Bsub-Apyr	7.01	25.37
Bsub-Tthe	1.26	0.91
Bsub-Drad	34.92	3.41
Tmar-Apyr	0.52	0.43
Tthe-Drad	2.10	13.65
Tmar-Drad	3.14	4.59
Apyr-Drad	5.99	6.85
Tmar-Tthe	7.90	1.42
Apyr-Tthe	9.06	1.77

(a) Marginal probabilities at leaf nodes for bacterial data set.
Leaf Node	A	C	G	T

Tthe	0.219	0.278	0.354	0.149
Tmar	0.207	0.279	0.359	0.155
Apyr	0.214	0.287	0.354	0.145
Drad	0.250	0.233	0.321	0.195
Bsub	0.251	0.238	0.319	0.191
(b) Marginal probabilities at internal nodes for tree #1.
Internal Node	A	C	G	T

Node-2	0.216	0.272	0.358	0.154
Node-3	0.218	0.269	0.357	0.156
Node-4	0.210	0.282	0.36	0.148
(c) Marginal probabilities atinternal nodes for tree #2.

Internal Node	A	C	G	T

Node-2	0.214	0.275	0.360	0.151
Node-3	0.227	0.257	0.342	0.174
Node-4	0.212	0.281	0.361	0.146

a	A	C	G	T
A	260.2	12.8	37.9	0.0
C	2.1	280.7	6.2	6.0
G	4.7	9.7	378.0	2.6
T	0.5	32.9	21.2	182.3

(a) Observed and estimated divergence matrix values for Bacillus-Aquifex pair
(i)		A	C	G	T	(ii)	A	C	G	T

	A	0.195	0.019	0.034	0.004		0.191	0.018	0.039	0.004
	C	0.005	0.201	0.02	0.012		0.006	0.194	0.024	0.014
	G	0.012	0.030	0.273	0.004		0.014	0.038	0.262	0.005
	T	0.002	0.037	0.027	0.125		0.003	0.037	0.03	0.121
(b) Observed and estimated divergence matrix values for Bacillus-Deinococcus pair
(i)		A	C	G	T	(ii)	A	C	G	T

	A	0.209	0.007	0.023	0.011		0.199	0.012	0.032	0.008
	C	0.006	0.192	0.017	0.023		0.012	0.176	0.019	0.031
	G	0.023	0.015	0.271	0.011		0.032	0.018	0.252	0.017
	T	0.012	0.019	0.011	0.149		0.008	0.027	0.017	0.139

(a) Tree #1. Refer Figure 2 for an explanation of node numbers
Edge	Bowker’s Test	Stuart’s Test

Bsub, Node-2	0.000	0.000
Node-2, Node-3	0.427	0.835
Node-2, Node-4	0.002	0.005
Node-3, Tthe	0.567	0.390
Node-3, Drad	0.000	0.000
Node-4, Tmar	0.646	0.359
Node-4, Apyr	0.135	0.742
(b) Tree #2. Refer Figure 3 for an explanation of node numbers
Edge	Bowker’s Test	Stuart’s Test

Tthe, Node-2	0.568	0.607
Node-2, Node-3	0.000	0.000
Node-2, Node-4	0.167	0.264
Node-3, Drad	0.000	0.000
Node-3, Bsub	0.000	0.000
Node-4, Tmar	0.315	0.092
Node-4, Apyr	0.130	0.843

Edge	Distance using BH	Distance using DNAML	Confidence Interval (DNAML)
Tthe, Node-2	0.064	0.068	0.050–0.086
Node-2, Node-3	0.050	0.050	0.033–0.066
Node-3, Bsub	0.106	0.105	0.083–0.126
Node-3, Drad	0.110	0.108	0.087–0.130
Node-2, Node-4	0.046	0.047	0.031–0.063
Node-4, Tmar	0.059	0.063	0.046–0.079
Node-4, Apyr	0.122	0.122	0.099–0.145

PERMALINK

Estimation of Phylogeny Using a General Markov Model

Vivek Jayaswal

Lars S Jermiin

John Robinson

Abstract

1. Introduction

2. A General Markov Model on Trees

3. Estimation

3.1. Precise Fit At Leaf Nodes

3.2. Internal Consistency

4. Algorithm Implementation

4.1. Computation of Edge Length

4.2. Variation in log likelihood values

5. Application to two sets of homologous sequences

5.1 Hominoid Data

5.1.1. Assessment of phylogenetic assumptions

Table 1.

Table 2.

Table 3.

5.1.2. Calculating the degrees of freedom

5.1.3. Inferring and comparing the trees

Figure 1.

Table 4.

Table 5.

5.1.4. Tree-dependent comparison of edge lengths

Table 6.

5.1.5. Evaluation of Divergence Matrices and Substitution Biases

Table 7.

Table 8.

Table 9.

5.2 Bacterial Data

5.2.1. Assessment of phylogenetic assumptions

Table 10.

5.2.2. Inferring the trees

Figure 2.

Figure 3.

5.2.3. Comparison of edge lengths

Table 11 (a).

5.2.4. Evaluation of Divergence Matrices and Substitution Biases

Table 12.

Table 13.

Table 14.

Table 15.

Table 16.

6. Performance

7. Conclusion

8. Future work

Table 11 (b).

9 Acknowledgment

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases