Statistical tests to identify appropriate types of nucleotide sequence recoding in molecular phylogenetics

Victor A Vera-Ruiz; Kwok W Lau; John Robinson; Lars S Jermiin

doi:10.1186/1471-2105-15-S2-S8

. 2014 Jan 31;15(Suppl 2):S8. doi: 10.1186/1471-2105-15-S2-S8

Statistical tests to identify appropriate types of nucleotide sequence recoding in molecular phylogenetics

Victor A Vera-Ruiz ¹, Kwok W Lau ^2,³, John Robinson ¹, Lars S Jermiin ^2,^4,^✉

PMCID: PMC4015178 PMID: 24564837

Abstract

Background

Under a Markov model of evolution, recoding, or lumping, of the four nucleotides into fewer groups may permit analysis under simpler conditions but may unfortunately yield misleading results unless the evolutionary process of the recoded groups remains Markovian. If a Markov process is lumpable, then the evolutionary process of the recoded groups is Markovian.

Results

We consider stationary, reversible, and homogeneous Markov processes on two taxa and compare three tests for lumpability: one using an ad hoc test statistic, which is based on an index that is evaluated using a bootstrap approximation of its distribution; one that is based on a test proposed specifically for Markov chains; and one using a likelihood-ratio test. We show that the likelihood-ratio test is more powerful than the index test, which is more powerful than that based on the Markov chain test statistic. We also show that for stationary processes on binary trees with more than two taxa, the tests can be applied to all pairs. Finally, we show that if the process is lumpable, then estimates obtained under the recoded model agree with estimates obtained under the original model, whereas, if the process is not lumpable, then these estimates can differ substantially. We apply the new likelihood-ratio test for lumpability to two primate data sets, one with a mitochondrial origin and one with a nuclear origin.

Conclusions

Recoding may result in biased phylogenetic estimates because the original evolutionary process is not lumpable. Accordingly, testing for lumpability should be done prior to phylogenetic analysis of recoded data.

Keywords: Phylogeny; Markov model; stationarity; homogeneity,reversibility; recoding; lumping; nucleotides; primates

Introduction

When nucleotides intentionally are recoded to a 3- or 2-state alphabet in order to focus on a subset of the possible types of substitutions (e.g., transversions [1-3]) or reduce compositional heterogeneity [4], it is no longer appropriate to use model-based phylogenetic methods that rely solely on time-reversible, 4-state Markov models. Instead, one needs to use a 3- or 2-state Markov model to approximate the evolutionary processes for the recoded sequence data. This requirement was first realised by Phillips and Penny [5], who used a time-reversible 2-state Markov model [6] to analyse RY-recoded nucleotide sequences, and Gibson et al. [7], who developed a time-reversible 3-state Markov model to analyse Y-recoded nucleotide sequences. Before these studies, other investigators had used RY-recoded nucleotide sequences to infer the evolutionary relationships among mammals [1-3] and among bacteria [4].

Recoding of nucleotides and/or amino acids has been used repeatedly in recent phylogenetic studies [8-31]. However, the mathematical principles underpinning the recoding of nucleotides or amino acids have not yet been adequately examined. For example, it is not yet known whether the Markovian property is maintained after recoding and how this should be tested [32]. Without this knowledge, we may run the risk of using a promising procedure in a manner that turns out to be inappropriate for the data.

In this paper, we take a first step by considering tests for lumpability in a Markov model of evolution for pairs of homologous nucleotide sequences (we are aware of only one paper in the phylogenetic literature where the term lumpability is used [33], but there it was used in a different context). We only consider nucleotides but believe our tests could be generalized to encompass amino acids as well. We then illustrate the performance of our tests for lumpability using simulated and real data, and show that recoding of nucleotides should be used with caution when analysing DNA phylogenetically.

Methods

The theoretical basis for recoding nucleotides

Let $S = \{A, C, G, T\}$ be the set of nucleotides, and let $S^{'}$ be a partition of $S$ such that $S^{'} = {S_{1}, . . ., S_{q}}$ , where q <4. Then $S$ is reduced by grouping, or lumping, some of the original states (i.e., A, C, G, and T) into one or two new states (i.e., R, Y, S, W, M, K, B, D, H, and V )--in molecular phylogenetics, this procedure has been called recoding [34]. Table 1 presents the 13 possible recoding schemes and partitions of $S^{'}$ using notation established by the NC-IUB [35]. The 13 recoding schemes fall into three major grouping categories, as shown in Table 1.

Table 1.

The 13 ways of reducing a 4-letter state space ( $S$ ) to a 3- or 2-letter state space $S^{'}$ .

Nucleotide-grouping Subsets	Recoding notation	Resulting $S^{'}$	Major grouping category
{{A, G}, C, T}	R	{R, C, T}	{2 : 1 : 1}
{A, G, {C, T}}	Y	{A, G, Y}
{A, {C, G}, T}	S	{A, S, T}
{C, G, {A, T}}	W	{C, G, W}
{A, C, {G, T}}	M	{M, G, T}
{C, {A, G, T}}	K	{A, C, K}

{A, {C, G, T}}	B	{A, B}	{3 : 1}
{C, {A, G, T}}	D	{C, D}
{G, {A, C, T}}	H	{G, H}
{T, {A, C, G}}	V	{T, V}

{{A, G}, {C, T}}	RY	{R, Y}	{2 : 2}
{{A, T}, {C, G}}	SW	{S, W}
{{A, C}, {G, T}}	KM	{K, M}

Open in a new tab

The evolutionary process for two homologous nucleotide sequences

Consider two nucleotide sequences, A and B, each with n independently evolving sites, which have diverged under Markovian conditions from their common ancestor on a rooted, 2-tipped tree. Let π₀denote the initial probability vector of the nucleotide frequencies, such that $π_{0}^{T} = (π_{0_{1}}, π_{0_{2}}, π_{0_{3}}, π_{0_{4}})$ , where, for convenience of notation, we will use the subscripts 1, 2, 3, 4 to denote A, G, C, T . Over each edge of this tree, there is a substitution process, X(t) and Y(t), respectively, described by the transition probabilities

P_{i j}^{X} (t) = P [X (t) = j | X (0) = i]

and

P_{i j}^{Y} (t) = P [Y (t) = j | Y (0) = i] .

Let f_ij(t) denote the theoretical joint probability of a site being in state i in A and state j in B at time t:

f_{i j} (t) = P [X (t) = i, Y (t) = j | X (0) = Y (0)] .

(1)

Now, let F(t) = {f_ij(t)} denote the joint probability matrix and let P^X(t) and P^Y(t) denote the transition probability matrices of X(t) and Y(t). In practice, the two matrices cannot be identified from F(t) without some assumptions about the evolutionary processes of the sequences A and B. We assume that the processes are globally stationary, reversible, and homogeneous (SRH) (for definitions, see [36-39], and, in more detail, [40]). Given these assumptions, take P^X(t) = P^Y(t) = P(t) and, if π_Xand π_Ydenote the equilibrium probability distributions of the processes X(t) and Y(t), take π_X= π_Y= π₀= π and write Π = diag(π). Then, from (1), we get F(t) = P(t)^TΠP(t). The transition probability matrix can be expressed by an instantaneous rate matrix R, such that P(t) = e^Rt, where R_ij≥ 0 for i ≠ j, $R_{i i} = - \sum_{i \neq j}^{4} R_{i j},$ and π^TR = 0^T, where π^Tis the equilibrium distribution of R [40]. Furthermore, the instantaneous rate matrix can take the form R = SΠ, where S is a symmetric matrix with s_ij≥ 0 for i ≠ j, and $s_{i i} = - \sum_{i \neq j}^{4} s_{i j} π_{j} / π_{i}$ [40]. The matrices R and P(t) can be written in terms of the eigenvector decomposition of Π^1/2SΠ^1/2 = LΛL^T. In other words

R = Π^{- 1 / 2} L Λ L^{T} Π^{1 / 2}

(2)

and

P (t) = Π^{- 1 / 2} L e^{Λ t} L^{T} Π^{1 / 2},

(3)

where Λ is a diagonal matrix with columns containing the eigenvalues of Π^1/2SΠ^1/2 and L is a matrix with columns containing its right eigenvectors. The joint probability matrix is then symmetric and

F (t) = Π^{1 / 2} L e^{2 Λ t} L^{T} Π^{1 / 2} .

(4)

Note that under these assumptions, there are only nine free parameters to be estimated: six free parameters for the off-diagonal elements of S (define s^T= (s₁₂, s₁₃, s₁₄, s₂₃, s₂₄, s₃₄)) and three free parameters for π (because $π_{4} = 1 - \sum_{i = 1}^{3} π_{i}$ ). The time t can be fixed at 1 since modifying it is equivalent to modifying the s-parameters.

Let N denote the 4 × 4 divergence matrix for A and B, such that N = {n_ij}, where n_ijrepresents the number of homologous sites that are in state i in A and state j in B. Under the model, the vector of elements of N has a multinomial distribution with parameters n and F(t); its expected value is thus E(N) = nF(t). Because the parameters s and π are in a one-to-one relation with the elements of F, the maximum-likelihood estimates of s and π can be obtained from the eigenvector decomposition of $\hat{F} (t) = \frac{1}{2 n} (N + N^{T})$ , then

\hat{Π} = d i a g (\hat{F} (1) 1)

(5)

and

\hat{S} = {\hat{Π}}^{- 1 / 2} \hat{L} \hat{Λ} {\hat{L}}^{T} {\hat{Π}}^{- 1 / 2},

(6)

where $\hat{L}$ and $\hat{Λ}$ are obtained from ${\hat{Π}}^{- 1 /) 2} \hat{F} (1) {\hat{Π}}^{- 1 /) 2} = \hat{L} e^{2 \hat{Λ}} {\hat{L}}^{T}$ .

Lumpable Markov chains

The following probabilities can be defined for any given $S^{'} = {S_{1}, \cdot \cdot \cdot, S_{q}}$ , where q <4, such that a lumped process, $X^{'} (t)$ , with a smaller number of states, is generated with transition probabilities

\begin{gathered} {P^{'}}_{k l} (t) = P [X^{'} (t) = l | X^{'} (0) = k] \\ = P [X (t) \in S_{l} | X (0) \in S_{k}], \end{gathered}

(7)

and initial probabilities $π^{'} = P [X^{'} (0) = k] = P [X (0) \in S_{k}]$ .

By definition [41,42], a Markov process is lumpable if, for every starting vector π, the lumped process, defined in (7), is a Markov chain whose transition probabilities do not depend on the choice of π. A necessary and sufficient condition for X'(t) to be lumpable with respect to a partition $S^{'}$ is that for every pair of subsets, $S_{k}$ and $S_{l}$ , $\sum_{j \in S_{l}} P_{i j} (t)$ , has the same value for every state i in $S_{k}$ [41,42]. Accordingly, if X'(t) is lumpable, then the transition probabilities for X'(t) for any given pair of subsets in $S^{'}$ are

P_{k l}^{'} (t) = \sum_{j \in S_{l}} P_{i j} (t), for any i \in S_{k} .

If the Markov chain is lumpable, the lumped transition matrix P'(t) can be expressed as a matrix function of P(t) as follows:

P^{'} (t) = U P (t) V,

where V is a 4 × q matrix, where q is the number of states in the lumped process, such that the l-th column of V is a vector with 1's in the components corresponding to states in S_land 0's otherwise, and

U = {(V^{T} Π V)}^{- 1} V^{T} Π,

is a q × 4 matrix whose k-th row is a probability vector with non-zero elements corresponding to the states in $S_{k}$ . A useful necessary and sufficient condition for lumpability [41,42] is

V U P (t) V = P (t) V .

(8)

In the case of nucleotides, the second column of Table 2 gives the conditions required for lumpability of four nucleotides.

Table 2.

Conditions required for a 4-state Markovian process to be lumpable (in terms of s and π), and transformations to obtain $\tilde{π}$ and $\tilde{s}$ such that the lumpability holds.

$S^{'}$	Lumpability conditions	( $\tilde{π}$ , $\tilde{s}$ )
{{A, G}, C, T}	s₁₂= s₂₃ s₁₄= s₃₄	${\tilde{s}}_{13} = {\tilde{s}}_{23} = (ŝ_{13} + ŝ_{23}) / 2$ ${\tilde{s}}_{14} = {\tilde{s}}_{34} = (ŝ_{14} + ŝ_{34}) / 2$

{A, G, {C, T}}	s₁₂= s₁₄ s₂₃= s₃₄	${\tilde{s}}_{12} = {\tilde{s}}_{14} = (ŝ_{12} + ŝ_{14}) / 2$ ${\tilde{s}}_{23} = {\tilde{s}}_{34} = (ŝ_{23} + ŝ_{34}) / 2$

{A, {C, G}, T}	s₁₂= s₁₃ s₂₄= s₃₄	${\tilde{s}}_{12} = {\tilde{s}}_{13} = (ŝ_{12} + ŝ_{13}) / 2$ ${\tilde{s}}_{24} = {\tilde{s}}_{34} = (ŝ_{24} + ŝ_{34}) / 2$

{C, G, {A, T}}	s₁₂= s₂₄ s₁₃= s₃₄	${\tilde{s}}_{12} = {\tilde{s}}_{24} = (ŝ_{12} + ŝ_{24}) / 2$ ${\tilde{s}}_{13} = {\tilde{s}}_{34} = (ŝ_{13} + ŝ_{34}) / 2$

{{A, C}, G, T}	s₁₃= s₂₃ s₁₄= s₂₄	${\tilde{s}}_{13} = {\tilde{s}}_{23} = (ŝ_{13} + ŝ_{23}) / 2$ ${\tilde{s}}_{14} = {\tilde{s}}_{24} = (ŝ_{14} + ŝ_{24}) / 2$

{A, C, {G, T}}	s₁₃= s₁₄ s₁₂= s₁₃	${\tilde{s}}_{13} = {\tilde{s}}_{14} = (ŝ_{13} + ŝ_{14}) / 2$ ${\tilde{s}}_{23} = {\tilde{s}}_{24} = (ŝ_{23} + ŝ_{24}) / 2$

{A, {C, G, T}}	s₁₂= s₁₃ s₁₃= s₁₄	${\tilde{s}}_{12} = {\tilde{s}}_{13} = {\tilde{s}}_{14} = (ŝ_{12} + ŝ_{13} + ŝ_{14}) / 3$

{C, {A, G, T}}	s₁₂= s₂₃ s₂₃= s₂₄	${\tilde{s}}_{12} = {\tilde{s}}_{23} = {\tilde{s}}_{24} = (ŝ_{12} + ŝ_{23} + ŝ_{24}) / 3$

{G, {A, C, T}}	s₁₃= s₂₃ s₂₃= s₃₄	${\tilde{s}}_{13} = {\tilde{s}}_{23} = {\tilde{s}}_{34} = (ŝ_{13} + ŝ_{23} + ŝ_{34}) / 3$

{T, {A, C, G}}	s₁₄= s₂₄ s₂₄= s₃₄	${\tilde{s}}_{14} = {\tilde{s}}_{24} = {\tilde{s}}_{34} = (ŝ_{14} + ŝ_{24} + ŝ_{34}) / 3$

{{A, G}, {C, T}}	s₁₂π₂+ s₁₄π₄= s₂₃π₂+ s₃₄π₄ s₁₂π₁+ s₂₃π₃= s₁₄π₁+ s₃₄π₃	${\tilde{s}}_{23} = \frac{{\tilde{s}}_{12} ({\hat{π}}_{2} {\hat{π}}_{3} - {\hat{π}}_{1} {\hat{π}}_{4}) + {\tilde{s}}_{14} {\hat{π}}_{4} ({\hat{π}}_{1} + {\hat{π}}_{3})}{{\hat{π}}_{3} ({\hat{π}}_{2} + {\hat{π}}_{4})}$ ${\tilde{s}}_{34} = \frac{{\tilde{s}}_{12} {\hat{π}}_{1} + {\tilde{s}}_{23} {\hat{π}}_{3} - {\tilde{s}}_{14} {\hat{π}}_{1}}{{\hat{π}}_{3}}$

{{A, T}, {C, G}}	s₁₂π₂+ s₁₃π₃= s₂₄π₂+ s₃₄π₃ s₁₂π₁+ s₂₄π₄= s₁₃π₁+ s₃₄π₄	${\tilde{s}}_{13} = \frac{{\tilde{s}}_{12} ({\hat{π}}_{1} {\hat{π}}_{3} - {\hat{π}}_{2} {\hat{π}}_{4}) + {\tilde{s}}_{24} {\hat{π}}_{4} ({\hat{π}}_{2} + {\hat{π}}_{3})}{{\hat{π}}_{3} ({\hat{π}}_{1} + {\hat{π}}_{4})}$ ${\tilde{s}}_{34} = \frac{{\tilde{s}}_{12} {\hat{π}}_{2} + {\tilde{s}}_{13} {\hat{π}}_{3} - {\tilde{s}}_{24} {\hat{π}}_{2}}{{\hat{π}}_{3}}$

{{A, C}, {G, T}}	s₁₃π₃+ s₁₄π₄= s₂₃π₃+ s₂₄π₄ s₁₃π₁+ s₂₃π₂= s₁₄π₁+ s₂₄π₄	${\tilde{s}}_{23} = \frac{{\tilde{s}}_{13} ({\hat{π}}_{2} {\hat{π}}_{3} - {\hat{π}}_{1} {\hat{π}}_{4}) + {\tilde{s}}_{14} {\hat{π}}_{4} ({\hat{π}}_{1} + {\hat{π}}_{2})}{{\hat{π}}_{2} ({\hat{π}}_{3} + {\hat{π}}_{4})}$ ${\tilde{s}}_{24} = \frac{{\tilde{s}}_{13} {\hat{π}}_{1} + {\tilde{s}}_{23} {\hat{π}}_{2} - {\tilde{s}}_{14} {\hat{π}}_{1}}{{\hat{π}}_{2}}$

Open in a new tab

We note that under certain conditions, such as those considered by the JC [43] and F81 [44] models, all recodings are lumpable. Conditions under which recoding of nucleotides are possible for the K2P [45] and HKY [46] models are given in Table 3.2 of [32].

Tests for lumpability

We consider three possible tests: An ad hoc test based on a parametric bootstrap for an index of departure from the lumpability condition [32]; a test based on a test for lumpability in Markov chains [47]; and a likelihood-ratio test.

Index test

From (8), if a Markov process is lumpable, then

M = V U P (t) V - P (t) V

should have all elements zero. Consider the index proposed in [32]:

η = {(\sum_{i, j} m_{i j}^{2})}^{1 / 2}, where m_{i j} = M .

(9)

It is clear that η ≥ 0, with η being 0 only under lumpable Markovian processes. Then, the hypothesis that the Markov process is lumpable is equivalent to the hypothesis H₀: η = 0. From the observed divergence matrix, N of two homologous sequences, assuming a SRH Markovian model of evolution, an estimate $\hat{η}$ can be used as a test statistic for H₀, where

\hat{η} = {(\sum_{i, j} {\hat{m}}_{i j}^{2})}^{1 / 2},

and $\hat{M} = V U \hat{P} (1) V - \hat{P} (1) V$ for $\hat{P} (1) = exp (\hat{S} \hat{Π})$ .

The distribution of $\hat{η}$ is unknown, so we propose an approximation to it that is based on the parametric bootstrap. The estimated vectors $\hat{π}$ and $ŝ$ do not necessarily satisfy the conditions for lumpability, so we obtain $\tilde{π}$ and $\tilde{s}$ using the relevant equations from the third column of Table 2 as estimates that do satisfy the lumpability condition. Once the $\tilde{π}$ and $\tilde{s}$ vectors are calculated, a procedure similar to that shown in (2), (3) and (4) is carried out such that the matrices $\tilde{R}$ , $\tilde{P}$ (1), and $\tilde{F} (1)$ are generated under the lumpability conditions. Now B matrices can be generated by simulation under conditions of lumpability, where we take $N_{b}^{*}$ , with b ∈ {1, ⋯, B}, to be independent and multinomial with parameters n and $\tilde{F} (1)$ . From each of these simulated samples, we calculate $F_{b}^{*} (1) = (N_{b}^{*} + N_{b}^{* T}) /2$ , $Π_{b}^{*}$ and $S_{b}^{*}$ from $F_{b}^{*}$ , as in (5) and (6), and then $P_{b}^{*} (1) = exp (S_{b}^{*} Π_{b}^{*}), M_{b}^{*} = {V U P}_{b}^{*} (1) V - P_{b}^{*} (1) V$ , and

η_{b}^{*} = {(\sum_{i, j} {\hat{m}}_{b i j}^{{_{*}}^{_{2}}})}^{1 / 2} .

The true P-value is then the probability that we obtain a value as large as or larger than the observed $\hat{η}$ , so a bootstrap approximation to this P-value is the proportion of $η_{i}^{*}, . . ., η_{B}^{*}$ exceeding $\hat{η}$ .

Markov chain test

A χ² test to determine whether a Markov chain is lumpable with respect to a partition $S^{'}$ is available [47]. The test is based on the comparison of observed transition frequencies to their respective theoretical counterparts under the null hypothesis that the chain is lumpable. The approach does not make any assumption about reversibility or stationarity of the process. The authors used a matrix of transition counts, {n_ij}, to estimate the transition probabilities p_ij, where n_ijrepresents the number of transitions into state j from state i in one step, so the number of steps in the Markov chain is n_••, where the subscript • indicates summation. Now, if we start from our divergence matrix N, where n_ijrepresents the number of sites that are in state i for sequence A and state j in sequence B, and the SRH assumptions are kept, either A or B can be assumed to be the original sequence at time 0, whereas the other one can be assumed to be the observed sequence at time 2 (since we took the edge lengths to be 1). Take A as the ancestral sequence, then the divergence matrix N has the same properties as a transition count matrix, and we can proceed as described in [47]. A transition probability from i to $S_{l}$ is

g_{i l} = \sum_{j \in S_{l}} P_{i j} (2),

where l = 1, ..., q. From the definition of a lumpable process (7), if the Markovian process is lumpable with respect to $S^{'}$ , then

g_{i l} = \sum_{i \in S_{k}} \sum_{j \in S_{l}} P_{i j} (2) / γ_{k},

where γ_kis the number of states that are part of the subset $S_{k}$ , and k = 1, ..., q. Therefore, $g_{i l} = P_{k l}^{'} (2)$ if the process is lumpable and the null hypothesis of lumpability can be expressed as H₀: $g_{i l} = P_{k l}^{'} (2)$ for all i ∈ $S_{k}$ . Given the divergence matrix, N, estimates g_iland $P_{k l}^{'} (2)$ are

ĝ_{i l} = \sum_{j \in S_{l}} \frac{n_{i j}}{n_{i ∙}}

and

{\hat{P}}_{k l}^{'} (2) = \frac{n_{k l}^{'}}{n_{k ∙}^{'}} = \frac{\sum_{i \in S_{k}} \sum_{j \in S_{l}} n_{i j}}{\sum_{i \in S_{k}} n_{i ∙}},

where $N^{'} = {n_{k l}^{'}}$ is the divergence matrix of the recoded nucleotide sequences. Jernigan and Baran [47] obtained the test statistic

T = \sum_{i = 1}^{4} \sum_{l = 1}^{q} \frac{{(o_{i l} - e_{i l})}^{2}}{e_{i l}},

where

o_{i l} = \sum_{j \in S_{l}}^{} n_{i j}

and

e_{i l} = \frac{n_{i ∙} n_{k l}^{'}}{n_{k ∙}^{'}},

and showed (by pointing out that o_il− e_ilare a stack of q tables of size 4 × γ_lof mean-corrected multinomials with row and column sums equal to zero) that the test statistic is distributed under H₀as a χ² variable with $(q - 1) \sum_{k = 1}^{q} (γ_{k} - 1)$ degrees of freedom, if all cells are non-zero. In the case considered here, the degrees of freedom for any of the recoding schemes is 2.

Likelihood-ratio test

Consider estimates $(\hat{π}, ŝ)$ whose values maximize a log-likelihood function

L (π, s) = \sum_{i, j} n_{i j} ln f_{i j} (π, s),

where {n_ij} = N, the observed divergence matrix, F_(π,s)(1) = exp(SΠ) and {f_ij(π, s)} = F_(π,s)(1). These matrices are obtained as shown in (5) and (6). We also want to estimate (π, s) under the constraints imposed by the null hypothesis of lumpablity, H₀. The constraints are given in the second column of Table 2. Then we can define the constrained estimates ( $\tilde{π}$ , $\tilde{s}$ ) to satisfy

L (\tilde{π}, \tilde{s}) = max_{π, s \in H_{0}} L (π, s) .

This maximization needs a new approach. We construct an orthogonal matrix, A, such that

A s = y,

where y is the response constraint vector, defined such that two values of y are zero corresponding to the two constraints. The matrix A will, in the case of partitions into two groups of two, contain π, so to emphasise this possible dependence, write y = g(s|π). Also write s = A⁻¹y = g⁻¹(y|π). Then

L (\tilde{π}, \tilde{s}) = max_{π, y} L (π, g^{- 1} (y | π)) .

The optimization process is done in two steps: the values of s, if dependent of π, are optimized given the original π set, then the π vector is optimized given the optimized values of s. This process is repeated until convergence is achieved.

From these two log-likelihood values, a log likelihood-ratio, LR, can be calculated with

L R = L (\hat{π}, ŝ) - L (\tilde{π}, \tilde{s})

Under the null hypothesis of lumpability, 2 × LR is distributed as a χ² variable with 2 degrees of freedom.

Results

Assessment of accuracy

In order to check the accuracy of the tests under the null hypothesis, Monte Carlo simulations were done from a set of parameters that meets the assumption of lumpability. The parameter vectors in this case were

π^{T} = (0.1, 0.2, 0.3, 0.4)

and

s^{T} = (0.2, 0.25, 0.2, 0.2, 0.15, 0.2) .

The joint probability distribution was calculated by the steps given in (2), (3), and (4); then, assuming a nucleotide sequence of length n = 1500, 5000 divergence matrices were calculated by Monte Carlo simulations assuming that N_iis multinomial with parameters (n, F(1)) for i = 1, . . . , 5000.

The accuracy of each test for lumpability was verified using a PP plot displaying the distribution of observed P-values, obtained from each test, plotted against the expected P-values, obtained from the uniform distribution. The linear relationship between these two sets of P-values (Figure 1) confirms the accuracy of the tests.

**PP-plot for the three tests with respect to $S^{'} = \{R, Y\}$** . The observed P-values were calculated from 5000 Monte Carlo experiments for each test (i.e., the Index test (a), the Markov chain test (b), and the likelihood-ratio test (c)) and charted against the expected P-values.

Comparisons of power

The power of each test was compared for each recoding scheme under non-lumpable conditions. To do this, we used π^T= (0.1, 0.2, 0.3, 0.4) and values of s that yield increasing values of η, as in (9), generated 3000 divergence matrices using Monte Carlo simulation, and then calculated the three test statistics and their corresponding P-values, using the procedures explained above, for each value of η. The power at the 5% level, is then equal to the proportion of observed P-values less than 0.05.

Figure 2 shows the power curves for RY recoding--similar power curves were obtained for the other 12 recoding schemes (results not shown). All of these results indicate that the likelihood-ratio test is the most powerful of the tests considered, followed by the Index test and, finally, by the Markov Chain test.

**Power curve for partition $S^{'} = \{R, Y\}$** . A total of 3000 divergence matrices were generated by Monte Carlo simulation for each triplet of points, corresponding to the three test statistics, and 500 parametric bootstraps were used during the calculation of P-values for the η index. The same π-vector (i.e., *π^T*= (0.1, 0.2, 0.3, 0.4)) was used for every triplet of points whereas the values in the s-vector were allowed to vary slightly for each point.

Cases with more than 2 homologous sequences

For general cases involving more than 2 homologous sequences, we can test for lumpability in all pairs of sequences by using the methods described above under the assumption that the evolutionary process is SRH over the whole tree (i.e., the process is globally SRH). For example, in the case of an alignment with seven sequences, there will be 21 P-values. A PP-plot with these P-values should yield a straight line when the data are lumpable, and deviations from this expectation when the processes are not lumpable. However, the observed P-values are not independent, so we need to show that this condition is not cause for concern for the dots in a PP-plot to be on a straight line. We give a simplified argument taken from [48]. Consider a set of observed P-values P₁, . . ., P_n, which, if all the null hypotheses are true, are identically distributed as uniform random variables on (0, 1). Let p be any value between 0 and 1, let I(P_j< p) take the value 1 if P_j< p and 0 otherwise, and let $N_{p} = \sum_{j = 1}^{n} I (P_{j} < p)$ be the number of observed P-values less than p. Then E(I(P_j< p)) = P (P_j< p) = p and so the expected number of P-values less than p is E(N_p) = np. This implies that the plot of the observed P-values will lie approximately on a straight line. The dependence will cause some clustering of the observed P-values but the PP-plot will remain useful in indicating whether there is evidence against some of the hypotheses.

The PP-plots shown in Figure 3 were obtained from alignments of nucleotides generated under lumpable or non-lumpable conditions, with respect to RY recoding, on the tree shown in Figure 4 before being analysed using the likelihood-ratio test. From these two plots, it is clear that the test is able to identify cases where sequences have evolved under non-lumpable conditions.

**PP-plots for lumpability tests for simulated data**. PP-plots for the likelihood-ratio test, with respect to RY recoding, for the randomly-generated lumpable 7-taxon data (a) and the randomly-generated non-lumpable 7-taxon data (b). Sequences comprising 2000 sites were generated on the tree shown in Figure 4 under time-reversible conditions with sites evolving under independent and identical conditions ((a): *π^T*= (0.1, 0.2, 0.3, 0.4), and *s^T*= (0.20, 0.25, 0.20, 0.20, 0.15, 0.20); (b): *π^T*= (0.1, 0.2, 0.3, 0.4), and *s^T*= (0.50, 0.25, 0.20, 0.20, 0.15, 0.20)).

**Tree used to generate simulated data**. Tree used to generate the simulated data, which were then analysed to obtain the results shown in Figure 3. The scale bar corresponds to 1 time unit.

The effect of non-lumpability on phylogenetic estimates

If a process is lumpable with respect to a given recoding scheme (e.g., RY), then we can obtain F_q(t) = V^TP(t)V and from this, using (2, 3) and (4), we can further obtain the transition matrix of the process X'(t),

P_{q} (t) = Π_{q}^{- 1 / 2} L_{q} {e^{Λ_{q}}}^{t} L_{q}^{T} Π_{q}^{1 / 2},

where Π_q= V^TΠV . If the process is not lumpable with respect to that recoding scheme, then X'(t) is not a Markov process and, although we can calculate P_q(t), it is not the transition matrix of the process X'(t).

In either case, the matrix P'(t) = UP(t)V can be defined and, if the process X(t) is lumpable, then P'(t) = P_q(t). On the other hand, if X(t) is not lumpable, the elements of P'(t) are still given as $P_{k l}^{'} (t) = P (X^{'} (t) = l | X^{'} (0) = k)$ , but P'(t) is no longer a transition matrix of X'(t). Conveniently, we can compare P'(1), the true conditional probability matrix at t = 1, with P_q(1), the false transition matrix at t = 1, thus allowing us to examine the effect of non-lumpability.

Figure 5 illustrates the effect on phylogenetic estimates. Figure 5a shows the tree used to simulate alignments of 3000 nucleotides under a time-reversible 4-state Markov model with π^T= (0.2, 0.3, 0.2, 0.3) and s^T= (0.2, 0.1, 0.3, 0.3, 1.0, 0.2). As can be seen from the π- and s-vectors, the lumpable condition is met for RY-recoding but not for KM-recoding. Figures 5b and 5c display the corresponding tree with the edge lengths adjusted according to, respectively, the RY- and KM-recoding schemes.

**Effect of recoding on phylogenetic estimates**. Panel (a) displays the tree that was used to generate alignments of 3000 nucleotides under a time-reversible 4-state Markov model. Panel (b) shows the corresponding tree with edge lengths adjusted according to the RY-recoding scheme. Panel (c) shows the corresponding tree with edge lengths adjusted according to the KM-recoding scheme. Panels (d), (e), and (f) present the corresponding results obtained by analysis of the data generated on the tree in panel (a). The scale bar corresponds to the expected number of substitutions $(i . e ., - \sum_{i = 1}^{4} π_{i} r_{i i})$ .

Every edge in the tree obtained from the RY-recoded data is shorter than the corresponding edge in the tree obtained from the original data. However, because the original process was lumpable with respect to RY-recoding, the relative length of each edge in the two trees is the same, the difference being equal to a scale factor

ρ = \frac{\sum_{i = 1}^{4} π_{i} r_{i i}}{\sum_{j = 1}^{2} π_{2 j} r_{2 j j}} .

Every edge in the tree obtained from the KM-recoded data is also shorter than the corresponding edge in the tree obtained from the original data, but the relative length of each edge in the two trees differ, the reason being that the process generating the original data was not lumpable with respect to KM-recoding.

Figures 5d, 5e, and 5f show the corresponding results for the data generated by Monte Carlo simulation. The three trees display the same characteristics as those shown in Figures 5a, 5b, and 5c, while also showing some variation in the edge lengths that is due to the random nature of the data and the finite sample size. Hence, although recoding of nucleotides might be useful for a variety of reasons, using recoded data, without having tested for lumpability first, might lead to biased phylogenetic estimates.

Example 1 -- Primate mitochondrial DNA

In a previous study [37], a set of mitochondrial nucleotide sequences of hominoid origin were found to to fit the GTR model [49], implying that the data are consistent with evolution under globally SRH conditions. We applied the likelihood-ratio test to these data. Figure 6 shows the PP-plots from tests for lumpability for RY recoding, indicating non-lumpability, and AGY recoding, indicating lumpability.

**PP-plots for lumpability tests for hominoid mitochondrial data**. PP-plots for the likelihood-ratio tests for lumpability for hominoid data for RY-recoding (a) and for *AGY*-recoding (b), respectively.

Example 2 -- Primate nuclear DNA

In a previous study [50], a ~9.3kb fragment of X chromosomal DNA was obtained from 26 species of primates and analysed phylogenetically using the HKY model. In so doing, the authors implicitly assumed evolution under globally SRH conditions. We wanted to apply the likelihood-ratio test to these data, so we obtained the same 26 sequences from GenBank [51], aligned them using MAFFT [52] (with the linsi option invoked), and, using SeaView [53], removed all columns with gaps and/or ambiguous characters. The resulting alignment contained 6913 sites from the 26 species. We then applied the matched-pairs test of symmetry [38] to the data to determine whether the sequences were consistent with evolution under globally SRH conditions. The PP-plot in Figure 7 clearly shows that the data are consistent with evolution under these conditions. Hence, it is appropriate to use our likelihood-ratio test to determine whether any of the recoding schemes would retain the Markovian properties of the original data. Figure 8 presents the PP-plots from the likelihood-ratio test for lumpability for RY-recoding, showing strong evidence against lumpability, and for the SW-recoding, which provided the least evidence against lumpability. It is evident that no recoding should be applied to these data.

**PP-plot for matched-pairs test of symmetry for primate nuclear data**. PP-plot demonstrating consistency with evolution under globally SRH conditions for the 4-state process.

**PP-plots for tests of lumpability for primate nuclear data**. PP-plots for lumpability for RY-recoding (a) and SW-recoding (b), respectively.

Conclusions

Bias in estimates of phylogenetic parameters can occur when recoding of nucleotides or amino acids is used to transform data associated with models of evolution, which are not lumpable with respect to the recoding scheme used. A test proposed in this paper, which is based on a likelihood-ratio test, can yield an indication of whether the same results for estimable parameters can be expected from fitting a given model of evolution and its recoded version to the data.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

LSJ and JR conceived the project. KWL carried out the pilot study, showing that the Index test was feasible. VAVR and JR developed the tests and wrote the manuscript with input from LSJ and KWL. VAVR wrote the codes and generated the numerical results. All authors have read and approved the manuscript.

Contributor Information

Victor A Vera-Ruiz, Email: V.VeraRuiz@maths.usyd.edu.au.

Kwok W Lau, Email: Rex.Lau@csiro.au.

John Robinson, Email: John.Robinson@sydney.edu.au.

Lars S Jermiin, Email: Lars.Jermiin@csiro.au.

Acknowledgements

VAVR was supported by the University of Sydney World Scholars scheme. LSJ, KWL, and JR were supported by a Discovery Grant (DP0453173) from the Australian Research Council. We thank DL Lovell and TKF Wong for their constructive suggestions to this manuscript.

Declarations

The publication costs for this article were funded by resources made available to LSJ by CSIRO and JR by the School of Mathematics and Statistics, University of Sydney.

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 2, 2014: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S2.

References

Irwin DM, Kocher TD, Wilson AC. Evolution of the cytochrome b gene in mammals. Journal of Molecular Evolution. 1991;32:128–144. doi: 10.1007/BF02515385. [DOI] [PubMed] [Google Scholar]
Adkins RM, Honeycutt RL. Molecular phylogeny of the superorder Arconta. Proceedings of the National Academy of Science of the United States of America. 1991;88:10317–10321. doi: 10.1073/pnas.88.22.10317. [DOI] [PMC free article] [PubMed] [Google Scholar]
Adkins RM, Honeycutt RL. Evolution of the primate cytochrome c oxidase subunit II gene. Journal of Molecular Evolution. 1994;38:215–231. doi: 10.1007/BF00176084. [DOI] [PubMed] [Google Scholar]
Woese CR, Achenbach L, Rouviere P, Mandelco L. Archaeal phylogeny: reexamination of the phylogenetic position of Archaeoglobus fulgidus in light of certain composition-induced artifacts. Systematic and Applied Microbiology. 1991;14:364–371. doi: 10.1016/S0723-2020(11)80311-5. [DOI] [PubMed] [Google Scholar]
Phillips MJ, Penny D. The root of the mammalian tree inferred from whole mithocondrial genomes. Molecular Phylogenetics and Evolution. 2003;28:171–185. doi: 10.1016/S1055-7903(03)00057-5. [DOI] [PubMed] [Google Scholar]
Cavender JA, Felsenstein J. Invariants of phylogenies in a simple case with discrete states. Journal of Classification. 1987;4:57–71. doi: 10.1007/BF01890075. [DOI] [Google Scholar]
Gibson A, Gowri-Shankar V, Higgs PG, Rattray M. A comprehensive analysis of mammalian mithochondrial genome base composition and improved phylogenetic methods. Molecular Biology and Evolution. 2005;22:251–264. doi: 10.1093/molbev/msi012. [DOI] [PubMed] [Google Scholar]
Millen RS, Olmstead RG, Adams KL, Palmer JD, Lao NT, Heggie L, Kavanagh TA, Hibberd JM, Gray JC, Morden CW, Calie PJ, Jermiin LS, Wolfe KH. Many parallel losses of infA from chloroplast DNA during angiosperm evolution with multiple independent transfers to the nucleus. The Plant Cell. 2001;13:645–658. doi: 10.1105/tpc.13.3.645. [DOI] [PMC free article] [PubMed] [Google Scholar]
Phillips MJ, Lin YH, Harrison GL, Penny D. Mitochondrial genomes of a bandicoot and a brushtail possum confirm the monophyly of australidelphian marsupials. Proceedings of the Royal Society London Series B. 2001;268:1533–1538. doi: 10.1098/rspb.2001.1677. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kosiol C, Goldman N, Buttimore NH. A new criterion and method for amino acid classification. Journal of Theoretical Biology. 2004;228:97–106. doi: 10.1016/j.jtbi.2003.12.010. [DOI] [PubMed] [Google Scholar]
Kosiol C. PhD thesis. University of Cambridge; 2006. Markov models for protein sequence evolution. [Google Scholar]
Phillips MJ, Delsuc F, Penny D. Genome-scale phylogeny and the detection of systematic biases. Molecular Biology and Evolution. 2004;21:1455–1458. doi: 10.1093/molbev/msh137. [DOI] [PubMed] [Google Scholar]
Ho JWK, Adams CE, Lew JB, Matthews TJ, Ng CC, Shahabi-Sirjani A, Tan LH, Zhao Y, Easteal S, Wilson SR, Jermiin LS. SeqVis: Visualization of compositional heterogeneity in large alignments of nucleotides. Bioinformatics. 2006;221:2162–2163. doi: 10.1093/bioinformatics/btl283. [DOI] [PubMed] [Google Scholar]
Susko E, Roger AJ. On reduced amino acid alphabets for phylogenetic inference. Molecular Biology and Evolution. 2007;24:2139–2150. doi: 10.1093/molbev/msm144. [DOI] [PubMed] [Google Scholar]
Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Molecular Biology and Evolution. 2004;26:255–271. doi: 10.1093/molbev/msn232. [DOI] [PubMed] [Google Scholar]
Masta SE, Longhorn SJ, Boore JL. Arachnid relationships based on mitochondrial genomes: asymmetric nucleotide and amino acid bias affects phylogenetic analyses. Molecular Phylogenetics and Evolution. 2009;50:117–128. doi: 10.1016/j.ympev.2008.10.010. [DOI] [PubMed] [Google Scholar]
Phillips MJ, Gibb GC, Crimp EA, Penny D. Tinamous and moa flock together: mitochondrial genome sequence analysis reveals independent losses of flight among ratites. Systematic Biology. 2010;59:90–107. doi: 10.1093/sysbio/syp079. [DOI] [PubMed] [Google Scholar]
Regier JC, Shultz JW, Zwick A, Hussey A, Ball B, Wetzer R, Martin JW, Cunningham CW. Arthropod relationships revealed by phylogenomic analysis of nuclear protein-coding sequences. Nature. 2010;463:1079–1083. doi: 10.1038/nature08742. [DOI] [PubMed] [Google Scholar]
Criscuolo A, Gribaldo S. Large-scale phylogenomic analyses indicate a deep origin of primary plastids within Cyanobacteria. Molecular Biology and Evolution. 2011;28:3019–3032. doi: 10.1093/molbev/msr108. [DOI] [PubMed] [Google Scholar]
Regier JC, Zwick A. Sources of signal in 62 protein-coding nuclear genes for higher-level phylogenetics of arthropods. PLoS ONE. 2011;6:e23408. doi: 10.1371/journal.pone.0023408. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cho S, Zwick A, Regier JC, Mitter C, Cummings MP, Yao J, Du Z, Zhao H, Kawahara AY, Weller S, Davis DR, Baixeras J, Brown JW, Parr C. Can deliberately incomplete gene sample augmentation improve a phylogeny estimate for the advanced moths and butterflies (Hexapoda: Lepidoptera)? Systematic Biology. 2011;60:782–796. doi: 10.1093/sysbio/syr079. [DOI] [PMC free article] [PubMed] [Google Scholar]
White NE, Phillips MJ, Gilbert MTP, Alfaro-Nunez A, Willerslev E, Mawson PR, Spencer PBS, Bunce M. The evolutionary history of cockatoos (Aves: Psittaciformes: Cacatuidae) Molecular Phylogenetics and Evolution. 2011;59:615–622. doi: 10.1016/j.ympev.2011.03.011. [DOI] [PubMed] [Google Scholar]
Zwick A, Regier JC, Cummings MP, Mitter C. Increased gene sampling yields robust support for higher-level clades within Bombycoidea (Lepidoptera) Systematic Entomology. 2011;36:31–43. doi: 10.1111/j.1365-3113.2010.00543.x. [DOI] [Google Scholar]
Niehuis O, Hartig G, Grath S, Pohl H, Lehmann J, Tafer H, Donath A, Krauss V, Eisenhardt C, Hertel J, Petersen M, Mayer C, Meusemann K, Peters RS, Stadler PF, Beutel RG, Bornberg-Bauer E, McKenna DD, Misof B. Genomic and morphological evidence converge to resolve the enigma of Strepsiptera. Current Biology. 2012;22:1309–1313. doi: 10.1016/j.cub.2012.05.018. [DOI] [PubMed] [Google Scholar]
Regier JC, Brown JW, Mitter C, Baixeras J, Cho S, Cummings MP, Zwick A. A molecular phylogeny for the leaf-roller moths (Lepidoptera: Tortricidae) and its implications for classification and life history evolution. PLoS ONE. 2012;7:e35574. doi: 10.1371/journal.pone.0035574. [DOI] [PMC free article] [PubMed] [Google Scholar]
Regier JC, Mitter C, Solis MA, Hayden JE, Landry B, Nuss M, Simonsen TJ, Yen S-H, Zwick A, Cummings MP. A molecular phylogeny for the pyraloid moths (Lepidoptera: Pyraloidea) and its implications for higher-level classification. Systematic Entomology. 2012;37:635–656. doi: 10.1111/j.1365-3113.2012.00641.x. [DOI] [Google Scholar]
Zwick A, Regier JC, Zwickl DJ. Resolving discrepancy between nucleotides and amino acids in deeplevel arthropod phylogenomics: differentiating serine codons in 21-amino-acid models. PLoS ONE. 2012;textbf7:e47450. doi: 10.1371/journal.pone.0047450. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gibb GC, Kennedy M, Penny D. Beyond phylogenetics and evolution: pelecaniform and Ciconiiform birds, and long-term niche stability. Molecular Phylogenetics and Evolution. 2013;68:229–238. doi: 10.1016/j.ympev.2013.03.021. [DOI] [PubMed] [Google Scholar]
Regier JC, Mitter C, Zwick A, Bazinet AL, Cummings MP, Kawahara AY, Sohn J-C, Zwickl DJ, Cho S, Davis DR, Baixeras J, Brown J, Parr C, Weller S, Lees DC, Mitter KT. A large-scale, higher-level, molecular phylogenetic study of the insect order Lepidoptera (Moths and Butterflies) PLoS ONE. 2013;8:e58568. doi: 10.1371/journal.pone.0058568. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rota-Stabelli O, Lartillot N, Philippe H, Pisani D. Serine codon-usage bias in deep phylogenomics: pancrustacean relationships as a case study. Systematic Biology. 2013;62:121–133. doi: 10.1093/sysbio/sys077. [DOI] [PubMed] [Google Scholar]
Sohn J-C, Regier JC, Mitter C, Davis D, Landry J-F, Zwick A, Cummings MP. A molecular phylogeny for Yponomeutoidea (Insecta, Lepidoptera, Ditrysia) and its implications for classification, biogeography and the evolution of host plant use. PLoS ONE. 2013;textbf8:e55066. doi: 10.1371/journal.pone.0055066. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lau KW. PhD thesis. University of Sydney, School of Biological Sciences;; 2009. Studies of methods used to infer molecular phylogeny: Dealing with the effect of compositional heterogeneity. [Google Scholar]
Guédon Y, d'Aubenton-Carafa Y, Thermes C. Analysing grouping of nucleotides in DNA sequences using lumped processes constructed from Markov chains. Journal of Mathematical Biology. 2006;52:343–372. doi: 10.1007/s00285-005-0358-y. [DOI] [PubMed] [Google Scholar]
Swofford DL, Olsen GJ, Waddell PJ, Hillis DM. In: Molecular Systematics. Hillis DM, Moritz C, Mable BK, editor. Sunderland: Sinauer Associates; 1996. Phylogenetic inference; pp. 407–514. [Google Scholar]
Nomenclature Committee of the International Union of Biochemistry, (NC-IUB) Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences: Recommendations 1984. Proceedings of the National Academy of Sciences of the United States of America. 1986;83:4–8. doi: 10.1073/pnas.83.1.4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bryant D, Galtier N, Poursat MA. In: Mathematics evolution and phylogeny. Gascuel O, editor. New York: Oxford University Press; 2005. Likelihood calculation in molecular phylogenetics; pp. 33–92. [Google Scholar]
Jayaswal V, Jermiin LS, Robinson J. Estimation of phylogeny using a general Markov model. Evolutionary Bioinformatics. 2005;1:62–80. [PMC free article] [PubMed] [Google Scholar]
Ababneh F, Jermiin LS, Ma C, Robinson J. Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics. 2006;22:1225–1231. doi: 10.1093/bioinformatics/btl064. [DOI] [PubMed] [Google Scholar]
Jermiin LS, Jayaswal V, Ababneh F, Robinson J. In: Bioinformatics: Data, sequence analysis, and evolution − Volume 1. Keith J, editor. Humana Press. Totawa; 2008. Phylogenetic model evaluation; pp. 331–363. [Google Scholar]
Ababneh F, Jermiin LS, Robinson J. Generation of the exact distribution and simulation of matched nucleotide sequences on a phylogenetic tree. Journal of Mathematical Modelling and Algorithms. 2006;5:291–303. doi: 10.1007/s10852-005-9017-y. [DOI] [Google Scholar]
Iosifescu M. Finite Markov processes and their applications. Chichester: John Wiley and Sons, Ltd; 1980. [Google Scholar]
Kemeny JG, Snell JL. Finite Markov chains. New York: Springer-Verlag; 1983. [Google Scholar]
Jukes TH, Cantor CR. In: Mammalian Protein Metabolism. Munro HN, editor. Academic Press. New York; 1969. Evolution of protein molecules; pp. 21–132. [Google Scholar]
Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution. 1980;16:111–120. doi: 10.1007/BF01731581. [DOI] [PubMed] [Google Scholar]
Kishino H, Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution. 1985;22:160–174. doi: 10.1007/BF02101694. [DOI] [PubMed] [Google Scholar]
Jernigan RW, Baran RH. Testing lumpability in Markov chains. Statistics and Probability Letters. 2003;64:17–23. doi: 10.1016/S0167-7152(03)00126-3. [DOI] [Google Scholar]
Schweder T, Spjotvoll E. Plots of P-values to evaluate many tests simultaneously. Biometrika. 1982;69:493–502. [Google Scholar]
Lanave C, Preparata G, Saccone C, Serio G. A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution. 1984;20:86–93. doi: 10.1007/BF02101990. [DOI] [PubMed] [Google Scholar]
Tosi AJ, Detwiler KM, Disotell TR. X-chromosomal window into the evolutionary history of the guenons (Primates: Cercopithecini) Molecular Phylogenetics and Evolution. 2005;36:58–66. doi: 10.1016/j.ympev.2005.01.009. [DOI] [PubMed] [Google Scholar]
Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. Genbank. Nucleic Acids Research. 2013;41:D36–D42. doi: 10.1093/nar/gks1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gouy M, Guindon S, Gascuel O. SeaView version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Molecular Biology and Evolution. 2010;27:221–224. doi: 10.1093/molbev/msp259. [DOI] [PubMed] [Google Scholar]

[B1] Irwin DM, Kocher TD, Wilson AC. Evolution of the cytochrome b gene in mammals. Journal of Molecular Evolution. 1991;32:128–144. doi: 10.1007/BF02515385. [DOI] [PubMed] [Google Scholar]

[B2] Adkins RM, Honeycutt RL. Molecular phylogeny of the superorder Arconta. Proceedings of the National Academy of Science of the United States of America. 1991;88:10317–10321. doi: 10.1073/pnas.88.22.10317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Adkins RM, Honeycutt RL. Evolution of the primate cytochrome c oxidase subunit II gene. Journal of Molecular Evolution. 1994;38:215–231. doi: 10.1007/BF00176084. [DOI] [PubMed] [Google Scholar]

[B4] Woese CR, Achenbach L, Rouviere P, Mandelco L. Archaeal phylogeny: reexamination of the phylogenetic position of Archaeoglobus fulgidus in light of certain composition-induced artifacts. Systematic and Applied Microbiology. 1991;14:364–371. doi: 10.1016/S0723-2020(11)80311-5. [DOI] [PubMed] [Google Scholar]

[B5] Phillips MJ, Penny D. The root of the mammalian tree inferred from whole mithocondrial genomes. Molecular Phylogenetics and Evolution. 2003;28:171–185. doi: 10.1016/S1055-7903(03)00057-5. [DOI] [PubMed] [Google Scholar]

[B6] Cavender JA, Felsenstein J. Invariants of phylogenies in a simple case with discrete states. Journal of Classification. 1987;4:57–71. doi: 10.1007/BF01890075. [DOI] [Google Scholar]

[B7] Gibson A, Gowri-Shankar V, Higgs PG, Rattray M. A comprehensive analysis of mammalian mithochondrial genome base composition and improved phylogenetic methods. Molecular Biology and Evolution. 2005;22:251–264. doi: 10.1093/molbev/msi012. [DOI] [PubMed] [Google Scholar]

[B8] Millen RS, Olmstead RG, Adams KL, Palmer JD, Lao NT, Heggie L, Kavanagh TA, Hibberd JM, Gray JC, Morden CW, Calie PJ, Jermiin LS, Wolfe KH. Many parallel losses of infA from chloroplast DNA during angiosperm evolution with multiple independent transfers to the nucleus. The Plant Cell. 2001;13:645–658. doi: 10.1105/tpc.13.3.645. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Phillips MJ, Lin YH, Harrison GL, Penny D. Mitochondrial genomes of a bandicoot and a brushtail possum confirm the monophyly of australidelphian marsupials. Proceedings of the Royal Society London Series B. 2001;268:1533–1538. doi: 10.1098/rspb.2001.1677. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Kosiol C, Goldman N, Buttimore NH. A new criterion and method for amino acid classification. Journal of Theoretical Biology. 2004;228:97–106. doi: 10.1016/j.jtbi.2003.12.010. [DOI] [PubMed] [Google Scholar]

[B11] Kosiol C. PhD thesis. University of Cambridge; 2006. Markov models for protein sequence evolution. [Google Scholar]

[B12] Phillips MJ, Delsuc F, Penny D. Genome-scale phylogeny and the detection of systematic biases. Molecular Biology and Evolution. 2004;21:1455–1458. doi: 10.1093/molbev/msh137. [DOI] [PubMed] [Google Scholar]

[B13] Ho JWK, Adams CE, Lew JB, Matthews TJ, Ng CC, Shahabi-Sirjani A, Tan LH, Zhao Y, Easteal S, Wilson SR, Jermiin LS. SeqVis: Visualization of compositional heterogeneity in large alignments of nucleotides. Bioinformatics. 2006;221:2162–2163. doi: 10.1093/bioinformatics/btl283. [DOI] [PubMed] [Google Scholar]

[B14] Susko E, Roger AJ. On reduced amino acid alphabets for phylogenetic inference. Molecular Biology and Evolution. 2007;24:2139–2150. doi: 10.1093/molbev/msm144. [DOI] [PubMed] [Google Scholar]

[B15] Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Molecular Biology and Evolution. 2004;26:255–271. doi: 10.1093/molbev/msn232. [DOI] [PubMed] [Google Scholar]

[B16] Masta SE, Longhorn SJ, Boore JL. Arachnid relationships based on mitochondrial genomes: asymmetric nucleotide and amino acid bias affects phylogenetic analyses. Molecular Phylogenetics and Evolution. 2009;50:117–128. doi: 10.1016/j.ympev.2008.10.010. [DOI] [PubMed] [Google Scholar]

[B17] Phillips MJ, Gibb GC, Crimp EA, Penny D. Tinamous and moa flock together: mitochondrial genome sequence analysis reveals independent losses of flight among ratites. Systematic Biology. 2010;59:90–107. doi: 10.1093/sysbio/syp079. [DOI] [PubMed] [Google Scholar]

[B18] Regier JC, Shultz JW, Zwick A, Hussey A, Ball B, Wetzer R, Martin JW, Cunningham CW. Arthropod relationships revealed by phylogenomic analysis of nuclear protein-coding sequences. Nature. 2010;463:1079–1083. doi: 10.1038/nature08742. [DOI] [PubMed] [Google Scholar]

[B19] Criscuolo A, Gribaldo S. Large-scale phylogenomic analyses indicate a deep origin of primary plastids within Cyanobacteria. Molecular Biology and Evolution. 2011;28:3019–3032. doi: 10.1093/molbev/msr108. [DOI] [PubMed] [Google Scholar]

[B20] Regier JC, Zwick A. Sources of signal in 62 protein-coding nuclear genes for higher-level phylogenetics of arthropods. PLoS ONE. 2011;6:e23408. doi: 10.1371/journal.pone.0023408. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Cho S, Zwick A, Regier JC, Mitter C, Cummings MP, Yao J, Du Z, Zhao H, Kawahara AY, Weller S, Davis DR, Baixeras J, Brown JW, Parr C. Can deliberately incomplete gene sample augmentation improve a phylogeny estimate for the advanced moths and butterflies (Hexapoda: Lepidoptera)? Systematic Biology. 2011;60:782–796. doi: 10.1093/sysbio/syr079. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] White NE, Phillips MJ, Gilbert MTP, Alfaro-Nunez A, Willerslev E, Mawson PR, Spencer PBS, Bunce M. The evolutionary history of cockatoos (Aves: Psittaciformes: Cacatuidae) Molecular Phylogenetics and Evolution. 2011;59:615–622. doi: 10.1016/j.ympev.2011.03.011. [DOI] [PubMed] [Google Scholar]

[B23] Zwick A, Regier JC, Cummings MP, Mitter C. Increased gene sampling yields robust support for higher-level clades within Bombycoidea (Lepidoptera) Systematic Entomology. 2011;36:31–43. doi: 10.1111/j.1365-3113.2010.00543.x. [DOI] [Google Scholar]

[B24] Niehuis O, Hartig G, Grath S, Pohl H, Lehmann J, Tafer H, Donath A, Krauss V, Eisenhardt C, Hertel J, Petersen M, Mayer C, Meusemann K, Peters RS, Stadler PF, Beutel RG, Bornberg-Bauer E, McKenna DD, Misof B. Genomic and morphological evidence converge to resolve the enigma of Strepsiptera. Current Biology. 2012;22:1309–1313. doi: 10.1016/j.cub.2012.05.018. [DOI] [PubMed] [Google Scholar]

[B25] Regier JC, Brown JW, Mitter C, Baixeras J, Cho S, Cummings MP, Zwick A. A molecular phylogeny for the leaf-roller moths (Lepidoptera: Tortricidae) and its implications for classification and life history evolution. PLoS ONE. 2012;7:e35574. doi: 10.1371/journal.pone.0035574. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] Regier JC, Mitter C, Solis MA, Hayden JE, Landry B, Nuss M, Simonsen TJ, Yen S-H, Zwick A, Cummings MP. A molecular phylogeny for the pyraloid moths (Lepidoptera: Pyraloidea) and its implications for higher-level classification. Systematic Entomology. 2012;37:635–656. doi: 10.1111/j.1365-3113.2012.00641.x. [DOI] [Google Scholar]

[B27] Zwick A, Regier JC, Zwickl DJ. Resolving discrepancy between nucleotides and amino acids in deeplevel arthropod phylogenomics: differentiating serine codons in 21-amino-acid models. PLoS ONE. 2012;textbf7:e47450. doi: 10.1371/journal.pone.0047450. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] Gibb GC, Kennedy M, Penny D. Beyond phylogenetics and evolution: pelecaniform and Ciconiiform birds, and long-term niche stability. Molecular Phylogenetics and Evolution. 2013;68:229–238. doi: 10.1016/j.ympev.2013.03.021. [DOI] [PubMed] [Google Scholar]

[B29] Regier JC, Mitter C, Zwick A, Bazinet AL, Cummings MP, Kawahara AY, Sohn J-C, Zwickl DJ, Cho S, Davis DR, Baixeras J, Brown J, Parr C, Weller S, Lees DC, Mitter KT. A large-scale, higher-level, molecular phylogenetic study of the insect order Lepidoptera (Moths and Butterflies) PLoS ONE. 2013;8:e58568. doi: 10.1371/journal.pone.0058568. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] Rota-Stabelli O, Lartillot N, Philippe H, Pisani D. Serine codon-usage bias in deep phylogenomics: pancrustacean relationships as a case study. Systematic Biology. 2013;62:121–133. doi: 10.1093/sysbio/sys077. [DOI] [PubMed] [Google Scholar]

[B31] Sohn J-C, Regier JC, Mitter C, Davis D, Landry J-F, Zwick A, Cummings MP. A molecular phylogeny for Yponomeutoidea (Insecta, Lepidoptera, Ditrysia) and its implications for classification, biogeography and the evolution of host plant use. PLoS ONE. 2013;textbf8:e55066. doi: 10.1371/journal.pone.0055066. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] Lau KW. PhD thesis. University of Sydney, School of Biological Sciences;; 2009. Studies of methods used to infer molecular phylogeny: Dealing with the effect of compositional heterogeneity. [Google Scholar]

[B33] Guédon Y, d'Aubenton-Carafa Y, Thermes C. Analysing grouping of nucleotides in DNA sequences using lumped processes constructed from Markov chains. Journal of Mathematical Biology. 2006;52:343–372. doi: 10.1007/s00285-005-0358-y. [DOI] [PubMed] [Google Scholar]

[B34] Swofford DL, Olsen GJ, Waddell PJ, Hillis DM. In: Molecular Systematics. Hillis DM, Moritz C, Mable BK, editor. Sunderland: Sinauer Associates; 1996. Phylogenetic inference; pp. 407–514. [Google Scholar]

[B35] Nomenclature Committee of the International Union of Biochemistry, (NC-IUB) Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences: Recommendations 1984. Proceedings of the National Academy of Sciences of the United States of America. 1986;83:4–8. doi: 10.1073/pnas.83.1.4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] Bryant D, Galtier N, Poursat MA. In: Mathematics evolution and phylogeny. Gascuel O, editor. New York: Oxford University Press; 2005. Likelihood calculation in molecular phylogenetics; pp. 33–92. [Google Scholar]

[B37] Jayaswal V, Jermiin LS, Robinson J. Estimation of phylogeny using a general Markov model. Evolutionary Bioinformatics. 2005;1:62–80. [PMC free article] [PubMed] [Google Scholar]

[B38] Ababneh F, Jermiin LS, Ma C, Robinson J. Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics. 2006;22:1225–1231. doi: 10.1093/bioinformatics/btl064. [DOI] [PubMed] [Google Scholar]

[B39] Jermiin LS, Jayaswal V, Ababneh F, Robinson J. In: Bioinformatics: Data, sequence analysis, and evolution − Volume 1. Keith J, editor. Humana Press. Totawa; 2008. Phylogenetic model evaluation; pp. 331–363. [Google Scholar]

[B40] Ababneh F, Jermiin LS, Robinson J. Generation of the exact distribution and simulation of matched nucleotide sequences on a phylogenetic tree. Journal of Mathematical Modelling and Algorithms. 2006;5:291–303. doi: 10.1007/s10852-005-9017-y. [DOI] [Google Scholar]

[B41] Iosifescu M. Finite Markov processes and their applications. Chichester: John Wiley and Sons, Ltd; 1980. [Google Scholar]

[B42] Kemeny JG, Snell JL. Finite Markov chains. New York: Springer-Verlag; 1983. [Google Scholar]

[B43] Jukes TH, Cantor CR. In: Mammalian Protein Metabolism. Munro HN, editor. Academic Press. New York; 1969. Evolution of protein molecules; pp. 21–132. [Google Scholar]

[B44] Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]

[B45] Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution. 1980;16:111–120. doi: 10.1007/BF01731581. [DOI] [PubMed] [Google Scholar]

[B46] Kishino H, Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution. 1985;22:160–174. doi: 10.1007/BF02101694. [DOI] [PubMed] [Google Scholar]

[B47] Jernigan RW, Baran RH. Testing lumpability in Markov chains. Statistics and Probability Letters. 2003;64:17–23. doi: 10.1016/S0167-7152(03)00126-3. [DOI] [Google Scholar]

[B48] Schweder T, Spjotvoll E. Plots of P-values to evaluate many tests simultaneously. Biometrika. 1982;69:493–502. [Google Scholar]

[B49] Lanave C, Preparata G, Saccone C, Serio G. A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution. 1984;20:86–93. doi: 10.1007/BF02101990. [DOI] [PubMed] [Google Scholar]

[B50] Tosi AJ, Detwiler KM, Disotell TR. X-chromosomal window into the evolutionary history of the guenons (Primates: Cercopithecini) Molecular Phylogenetics and Evolution. 2005;36:58–66. doi: 10.1016/j.ympev.2005.01.009. [DOI] [PubMed] [Google Scholar]

[B51] Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. Genbank. Nucleic Acids Research. 2013;41:D36–D42. doi: 10.1093/nar/gks1195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B52] Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B53] Gouy M, Guindon S, Gascuel O. SeaView version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Molecular Biology and Evolution. 2010;27:221–224. doi: 10.1093/molbev/msp259. [DOI] [PubMed] [Google Scholar]

PERMALINK

Statistical tests to identify appropriate types of nucleotide sequence recoding in molecular phylogenetics

Victor A Vera-Ruiz

Kwok W Lau

John Robinson

Lars S Jermiin

Supplement

Conference

Abstract

Background

Results

Conclusions

Introduction

Methods

The theoretical basis for recoding nucleotides

Table 1.

The evolutionary process for two homologous nucleotide sequences

Lumpable Markov chains

Table 2.

Tests for lumpability

Index test

Markov chain test

Likelihood-ratio test

Results

Assessment of accuracy

Figure 1.

Comparisons of power

Figure 2.

Cases with more than 2 homologous sequences

Figure 3.

Figure 4.

The effect of non-lumpability on phylogenetic estimates

Figure 5.

Example 1 -- Primate mitochondrial DNA

Figure 6.

Example 2 -- Primate nuclear DNA

Figure 7.

Figure 8.

Conclusions

Competing interests

Authors' contributions

Contributor Information

Acknowledgements

Declarations

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases