The Embedding Problem for Markov Models of Nucleotide Substitution

Klara L Verbyla; Von Bing Yap; Anuj Pahwa; Yunli Shao; Gavin A Huttley

doi:10.1371/journal.pone.0069187

. 2013 Jul 30;8(7):e69187. doi: 10.1371/journal.pone.0069187

The Embedding Problem for Markov Models of Nucleotide Substitution

Klara L Verbyla ^1,^3,^*, Von Bing Yap ², Anuj Pahwa ¹, Yunli Shao ¹, Gavin A Huttley ^1,^*

Editor: Konrad Scheffler⁴

PMCID: PMC3728303 PMID: 23935949

Abstract

Continuous-time Markov processes are often used to model the complex natural phenomenon of sequence evolution. To make the process of sequence evolution tractable, simplifying assumptions are often made about the sequence properties and the underlying process. The validity of one such assumption, time-homogeneity, has never been explored. Violations of this assumption can be found by identifying non-embeddability. A process is non-embeddable if it can not be embedded in a continuous time-homogeneous Markov process. In this study, non-embeddability was demonstrated to exist when modelling sequence evolution with Markov models. Evidence of non-embeddability was found primarily at the third codon position, possibly resulting from changes in mutation rate over time. Outgroup edges and those with a deeper time depth were found to have an increased probability of the underlying process being non-embeddable. Overall, low levels of non-embeddability were detected when examining individual edges of triads across a diverse set of alignments. Subsequent phylogenetic reconstruction analyses demonstrated that non-embeddability could impact on the correct prediction of phylogenies, but at extremely low levels. Despite the existence of non-embeddability, there is minimal evidence of violations of the local time homogeneity assumption and consequently the impact is likely to be minor.

Introduction

DNA sequences are widely used to infer evolutionary relationships among species, genes, and genomes. When modelling sequence evolution, like other complex natural phenomenon, simplifying assumptions are made for efficient computation. For sequence evolution maximum likelihood estimation for a probabilistic model is most common. This is because maximum likelihood estimation is statistically consistent (provided the underlying model is identifiable). All probabilistic models of sequence evolution generally adopt a set of simplifying assumptions relating to the sequence properties and the evolutionary process to make the models computationally tractable and statistically efficient. Markov models are commonly used and make the fundamental assumption that sites evolve independently according to a Markov process. The Markov chain is often assumed to be stationary, reversible, continuous and time-homogeneous. Stationarity assumes the process is in equilibrium resulting in equivalent ancestral and stationary base frequencies. Reversibility presumes the process appears identical when moving forward or backward in time, resulting in symmetric joint frequencies of ancestral and descendant bases. Continuity assumes the time interval between successive substitutions can be any positive number. Time-homogeneity means substitution rates at any time are fixed, described by a rate matrix ( Inline graphic ). A globally homogeneous process assumes that all branches share the same rate matrix. To relax the assumption of global time-homogeneity, some approaches now allow separate substitution rate matrices for each branch of the tree (local time homogeneity).

These computationally useful assumptions are in contrast to what is understood as the biological reality; for example, compositional changes in base frequencies are a feature of sequence evolution [1]. When assumptions are violated and the model cannot account for the confounding signals in the data, the inferred results have been demonstrated to be inconsistent and erroneous (e.g. [2]–[11]). Such studies revealed violations of the assumption(s) tested i.e. model misspecification, and demonstrated that these violations increase error rates and can result in the inference of the wrong tree topology and evolutionary distances. Despite these findings, when examining other assumptions, the validity of the presumption of local time-homogeneity has yet to be explored and so is examined in this study.

Sequences may have evolved from a homogeneous or inhomogeneous, time-continuous or discrete process. However because modelling a time-continuous inhomogeneous process is statistically infeasible, homogeneity is assumed when the widely implemented time-continuous models are used. The alternative can, to some extent, be captured by a discrete process. This alternative process could be time-continuous but inhomogeneous or simply discrete. The most commonly implemented discrete model was proposed by [12]. This model is referred to in this study as the BH model. Their approach makes only the assumption of process-homogeneity, but does not assume continuity, time-homogeneity (local or global), reversibility or stationarity. The BH model formulation has no instantaneous rate matrices, Inline graphic , but uses an independent transition probability matrix, , for each edge. If the process captured in the transition matrix () is a discrete manifestation of an underlying time homogeneous and continuous Markov chain, then relationship

(1)

holds (where Inline graphic is the matrix exponential). If the relationship holds true then the process is said to be embeddable and can in fact be modelled as continuous and time homogeneous. Conversely if the assumption of homogeneity is violated, the relationship in (1) does not hold and the underlying process is said to be non-embeddable. It could be non-embeddable because the process is discrete or continuous and time-inhomogeneous such that a Inline graphic exists where, for example, describes the process but there is no valid instantaneous rate matrix satisfying (1).

Inline graphic and matrices satisfying (1) must have certain characteristics in order to be valid Markov matrices. The substitution rate matrix is normally constrained to satisfy 3 conditions. It must have non-negative off-diagonals for where and , the rows must sum to 0, for (where is the dimension of Inline graphic , for nucleotides) and where are the base frequencies. The transition probability matrix is defined to have rows that sum to 1, for . A validly defined matrix will produce a valid [13]. However, the reverse is not true. , the converse of (1), can result in a valid , an invalid (a Inline graphic with negative off-diagonals) or be unable to produce a (the matrix logarithm of can not be calculated). In the cases where no valid can be produced, is non-embeddable and cannot be embedded into a continuous and time-homogeneous chain.

The question of how to formally determine if a transition matrix ( Inline graphic ) is embeddable is known as the embedding (or imbedding) problem and was first described in [14]. This study established the sufficient conditions for embeddability for a 2×2 matrix. Further investigations have been carried out into the sufficient conditions for embeddability for the 3×3 case for both time-homogeneous and non-homogeneous processes [15]–[21]. The complicating issue is that for a Inline graphic matrix where there are no simple sufficient general conditions for establishing if is embeddable. A set of simple steps for the case was put forward by [22] to enable the identification of matrices that were non-embeddable. The results of this study have been widely implemented in the sociology field for analyses with Markov processes and are adopted here.

Non-embeddability will occur where there is a need for different instantaneous rate matrices Inline graphic per branch (see Figure 1) caused either by a discrete process or by a time-continuous but inhomogeneous process. Evident from Figure 1 is that a natural control exists when modelling sequence evolution with an unrooted tree. On the outgroup edge in any unrooted tree are dual matrices reflecting that this branch contains both forward and backward time. Consequently, it is expected that if non-embeddability exists then it is likely to be found on the outgroup edge. In addition, it was suspected that non-embeddability was more likely to exist on edges with larger time depths.

The aim of this study was first to determine if there was evidence of violations of the assumption of local time-continuous homogeneity through establishing the existence of non-embeddability. Secondly, the study sought to determine the extent and effect of the occurrence of non-embeddability when modelling evolutionary processes with a time-homogeneous continuous Markovian model. Species triads from across the tree of life were analysed for evidence of non-embeddability. Due to unequal selection and mutagenesis pressures at the different codon positions, protein coding alignments were divided into codon positions. At each codon position, all edges in each sequence triad were tested separately for evidence of non-embeddability by allowing each individual edge to have independent Inline graphic and matrices. The evidence of non-embeddability was gathered by assessing the characteristics of the (and ) matrices. A parametric bootstrap approach comparing the log-likelihood ratio statistic (logLR) was then used to determine if there was a difference in model fit for those alignments where a non-embeddable Inline graphic was identified. A significant difference in model fit confirmed the violation of the assumption of time-homogeneity and that the process was non-embeddable. Once the study had demonstrated the existence of non-embeddability, the effect of non-embeddability, and consequently the violation of the assumption of local time-homogeneity, on phylogenetic reconstruction was explored.

Materials and Methods

Data

Four diverse datasets were used to test for the existence of non-embeddability across the tree of life. The characteristics of the datasets are summarised in Table 1. All datasets contained orthologs for at least three taxa and had distinct outgroup(s). Species triads were employed due to the consistency property of maximum likelihood tree reconstruction which showed that the joint distributions of three terminal nodes are enough to determine the full model [23]. For each data set, the sequences were aligned with all ambiguous sites and gaps removed using the progressive aligner from PyCogent [24]. Protein coding sequences were separated into codon positions due to the different selection pressure at each location [25], [26]. The identification of non-embeddability at a particular codon position will give an indication of whether the violation of the continuous time-homogeneous assumption is caused by a mutation or selection rate change. The datasets span both the vertebrates and microbes in order to fully investigate the existence of non-embeddability.

Table 1. Summary of Datasets.

Data Set	Taxa^a	Number of Alignments	Sequence Length (bp)	Total Tree length ^b
D1: Nuclear protein- coding genes	O,M,H	8193	>300	1.7081
	O,M,R	8014	>300	1.5622
	H,M,R	8394	>300	0.6890
D2: Mitochondrial protein-coding genes	M, H, O	11	67–598	3.7267
D3: Primate introns	C, H, Ma	62	>50,000	0.0763
D4: Microbial protein- coding genes	bad, bas, bba, bbu, bpn, bvu, cjk, dps, eca, ent, kra, lla, lre, mgi, mle, mta, ppe, pth, sma, wsu ^d	1	591–867	1.935

Open in a new tab

Inline graphic – C: Chimpanzee, H:Human, M:Mouse, Ma:Macaque, O:Opossum, R:Rat, bad:Bifidobacterium adolescentis, bas: Buchnera aphidicola Sg, bba:Bdellovibrio bacteriovorus, bbu:Borrelia burgdorferi B31, bpn: Candidatus Blochmannia pennsylvanicus, bvu: Bacteroides vulgatus, cjk:Corynebacterium jeikeium, dps:Desulfotalea psychrophila, eca:Pectobacterium atrosepticum, ent:Enterobacter sp. 638, kra:Kineococcus radiotolerans, lla:Lactococcus lactis subsp. lactis IL1403, lre:Lactobacillus reuteri DSM 20016, mgi:Mycobacterium gilvum, mle:Mycobacterium leprae TN, mta:Moorella thermoacetica, ppe:Pediococcus pentosaceus, pth:Pelotomaculum thermopropionicum, sma:Streptomyces avermitilis, wsu:Wolinella succinogenes, Inline graphic – average length from consensus tree , -All possible triads (1140).

Vertebrates

The vertebrate alignments were obtained from Ensembl release 58 except for the intron dataset which was obtained from Ensembl release 50. The sampling process for this intron dataset is described in detail in [27]. The first data set (D1) was used to investigate whether elapsed evolutionary time (time depth) influenced the existence of non-embeddability. This was investigated by using three triads with varied time depth between taxa. The triads of human, mouse and opossum, had the longest time depth between all taxa with opossum functioning as the outgroup. The triad of mouse, rat and opossum had a shorter time depth between the two ingroup taxa. The final triad consisting only of Eutherian taxa (mouse, rat, human) contains the shortest time depth between all taxa with the human group functioning as the outgroup. Whether non-embeddability occurred in both the nuclear and mitochondrial genomes was explored using datasets D1(nuclear) and D2 (mitochondrial). The intronic dataset (D3) was included to examine whether sequence function (coding/non-coding) impacted upon the presence of non-embeddability. The dinucleotide model was used to analyse this dataset as it has been demonstrated to give an improved model fit for this data [27].

Microbes

A single microbial protein-coding gene was selected to assess the extent of non-embeddability across a range of species. Twenty microbial species with differing estimated evolutionary divergence were randomly chosen from an aligned set of 197 microbial species for the gene, translation initiation factor, IF-2, originally extracted from the KEGG database [28]. The 197 species were originally selected as they had at least 500 orthologs from a set of 2226 orthologs that spanned at least 60 species. All 1140 possible triads for the 20 species for this gene were investigated for evidence of non-embeddability.

Substitution Models

For each triad, every edge was modelled separately assuming a discrete or continuous time homogeneous process. The assumptions for each process can be found in Table 2. Two differing Markov substitution models were used to test each edge for non-embeddability. The first model (herein referred to as the mixed model) assumed a continuous and time homogeneous process on the edge being tested for non-embeddability, while all other edges in this model and all edges in the second model (the discrete model) were modelled as discrete using the BH model (see Table 3). The models for the discrete and continuous processes applied to individual edges are described in the next section. All edges in both models were assumed to have a process that was independently and identically distributed, Inline graphic . If the process on an edge being tested was non-embeddable (i.e. generated by multiple see Figure 1) then a discrete model not assuming time-homogeneity for that edge will produce an non-embeddable and have a better model fit than a mixed model. Conversely if a single accurately describes the underlying process on an edge then the discrete model will generate a embeddable Inline graphic and will have the same model fit as the mixed model.

Table 2. Markov Process Assumptions for an Edge.

Assumption	Continuous	Discrete (BH)
Time- Homogeneity	√	X
Reversibility	X	X
Stationary	X	X
Independent Sites	√	√

Open in a new tab

Table 3. Summary of The Two Markov Models.

Edge	Tested^a	Mixed Model	Discrete Model
1	Yes	Continuous ^b	Discrete
2	No	Discrete	Discrete
3	No	Discrete	Discrete

Open in a new tab

Inline graphic – tested for non-embeddability, – Assumption of local time-homogeneity.

Maximum likelihood was used to obtain the model fit and parameter estimates for both models. The likelihood function was optimised using two optimisation approaches available in the PyCogent toolkit; the Powell method [29] and simulated annealing (global optimisation) [30]. Initial parameter estimates for the mixed model were provided by a continuous, globally time homogeneous model to help ensure optimisation. The parameter values for all edges from the mixed model were subsequently used as initial starting values for the discrete model. Providing initial parameter estimates to BH models is suggested as the algorithm is known to converge to local maxima if the initial values used for the Inline graphic matrices are not diagonally dominant or if the rate of convergence is too slow [31]–[33]. To check the stability of the original mixed model parameter estimates, parameter estimates from the discrete model were used as the starting values for a second optimisation of the mixed model. If a non-embeddable Inline graphic matrix was found using the discrete model, then there was no valid matrix to use as starting parameters for the mixed model. Consequently, the nearest valid for this edge was found by minimising the Frobenius norm of the difference between the non-embeddable and estimated nearest embeddable Inline graphic i.e. a that produces a valid (see Appendix A in Supporting Information S1). The logLR for the second optimisation of the mixed model was then compared with the original estimate to ensure stability and correct optimisation. The overall testing scheme is displayed in Figure 2.

Continuous and Discrete Markov Processes

There were 39 unknown parameters in both the mixed and fully discrete nucleotide models (12 for each Inline graphic and 3 parameters for the base frequencies i.e. where ) and 735 unknown parameters for the dinucleotide case (240 for each and 15 parameters for the dinucleotide frequencies). Each was produced either assuming a discrete (BH model) or continuous process. Under the BH model, a matrix is calculated based on the joint probability distribution of the nucleotides at each end of an edge. The likelihood was maximised using a system of iterative equations for the joint probability distributions along each edge. This approach for an unrooted triple of sequences from three species is well described in [33] and in more general terms in [32].

For a continuous process, the time-homogeneous transition probabilities, Inline graphic , are governed by the forward Kolmogorov equation and the initial condition:

(2)

where Inline graphic and are matrices and has the structure

graphic file with name pone.0069187.e080.jpg

where

The functions, Inline graphic , which are solutions of (1), comprise the transition matrices of a time homogeneous continuous Markov chain. Solutions for (2) are given by:

(3)

If the time factor is removed then (3) becomes Inline graphic . Constrained optimisation is used to find a valid [34]. It is then exponentiated to find an estimate of .

Embeddability

Let Inline graphic be a time-homogeneous transition matrix for a discrete Markov chain with finite states. If is a discrete manifestation of a continuous and time-homogeneous Markov chain, then is said to be embeddable and consequently is said to generate such that, as in (1):

However, this only holds true if Inline graphic can be embedded in a continuous Markov process. Whether is embeddable can be determined by mathematically assessing the characteristics of the and matrices.

The steps for determining if the transition matrix, Inline graphic , is embeddable where and adopted in this study are as follows:

[22], [34], [35].
The negative eigenvalues of must have even algebraic multiplicity [14], [16], [22].
Any complex eigenvalues of must occur in conjugate pairs [16], [22].
All eigenvalues of must lie inside a ‘heart shaped’ region in the complex plane whose boundary is the curve where

with restricted to [22], [36].
Examine for negative off-diagonals [37].

These steps are necessary but not sufficient and were the first stage of identifying non-embeddability in this study. This stage produced a reduced set of alignments for which an edge had a non-embeddable Inline graphic identified. This set of alignments was then examined using a parametric bootstrap.

Parametric Bootstrap

A parametric bootstrap scheme was implemented for two reasons. The first was because the distribution of the logLR is unknown. This is due to the identical number of parameters in each model (mixed and discrete). Therefore to determine whether there is a significant difference in model fit, a parametric bootstrap is required to establish the null distribution for the logLR. Secondly, the parametric bootstrap will also enable the determination of whether a non-embeddable Inline graphic matrix found when examining the characteristics of and is caused by a truly non-embeddable process. Maximum likelihood estimates of may identify non-embeddability just because substitutions are stochastic and computational precision issues could cause an non-embeddable despite the underlying process being time homogeneous.

The parametric bootstrap was used to test the null hypothesis, Inline graphic : The process can be embedded in a continuous chain and there is no violation of the time-homogeneous assumption i.e. the mixed model produces the same model fit, versus the alternative hypothesis, : The process is not embeddable in a continuous chain and there is a violation of the time-homogeneous assumption i.e. the discrete model has a better model fit than the mixed model. 1000 parametric bootstrap samples were simulated under the null hypothesis ( Inline graphic ) to establish the distribution of the test statistic for each alignment. This was carried out only for edges of alignments where a non-embeddable matrix was identified. The bootstrap testing scheme is outlined as follows:

Determine the logLR () for the observed alignment , where is the log likelihood produced by the discrete model and is the log likelihood indicated by the mixed model (with a continuous process fitted for the edge that produced a non-embeddable matrix).
Generate 1000 bootstrapped datasets of the alignment under the null hypothesis ().
Calculate the difference in logLR statistics for each bootstrap data set.
Calculate the proportion of times that .
Reject the null hypothesis, , when the proportion and confirm non-embeddability for the edge of the alignment tested.

In addition to the negative control described above, a positive control was also implemented. The parameter estimates from an randomly selected alignment found to have a non-embeddable edge was used to generate 1000 bootstrapped samples (i.e. under the alternative hypothesis). Each sample was 1000 base pairs in length. These were then tested for evidence of non-embeddability by examining the matrix characteristics and using the parametric bootstrap. The number of simulated alignments identified as non-embeddable was then calculated to determine the power of the procedure to correctly classify alignments generated by a non-embeddability process.

Phylogenetic Reconstruction

One important aim when modelling sequence evolution is to establish the correct relationship between the sequences and construct an accurate phylogenetic tree. Despite finding evidence that a model fits the data better than an alternative model, this does not always translate into different results when constructing the most probable trees [31]. To assess whether incorrectly modelling a process as time-homogeneous (and therefore embeddable) has an effect on phylogenetic reconstruction, a fully general continuous model assuming local time-homogeneity for all edges and the discrete BH model were used to find the most probable tree using maximum likelihood. The two models were used as implemented in the PyCogent toolkit with the continuous model (“General” model in PyCogent) having Inline graphic matrices for all edges set as independent to allow the assumption of local time-homogeneity (default setting is for global time-homogeneity). Datasets D1 and D3 were used. In the mammalian dataset (D1), 8005 alignments with sequences for the tetrad of mouse, rat, human and opossum were separated into codon positions. The second dataset contained 4845 tetrads formed using dataset D3 for the 20 species and gene IF2. Each alignment at all codon positions had a minimum length of 300 bp and the number of variable sites was required to be at least ten percent of total number of sites. This was to limit the possibility of incorrectly finding differences between models caused by a lack of information. For each tetrad the most probable tree was predicted using each model. Finding a difference in the predicted most probable trees will indicate a violation of the assumption of local homogeneity (and therefore non-embeddability of the process) can cause biases in phylogenetic construction.

The most probable tree was first estimated using the implemented ML trex method [38] in PyCogent. In cases where the models predicted a different tree for the same tetrad, the optimisation of the models was checked by fitting the complete models for two most probable trees in PyCogent. The total number of inconsistencies between the predicted phylogenies were then calculated for the separate codon positions.

Results

Evidence of non-embeddability was found in all 4 datasets analysed. The number of non-embeddable matrices and number of non-embeddable processes (where the null hypothesis was rejected in the parametric bootstrap) for each alignment and triad examined are shown in Tables 4–9. The assessment of the number and magnitude of negative off-diagonal elements when testing for non-embeddable matrices revealed an extremely high number of very small negative elements. This was most likely due to precision and sampling and thus a threshold of −0.1 for off-diagonal elements was used to declare non-embeddability for a Inline graphic matrix. This was an arbitrary threshold based on inspection of the results.

Table 4. Non-Embeddability – D1 Human, Rat, Mouse Triad (8394 Alignments).

		STEPS ^a
Edge	Codon position	1	2	3	4	5	NE^b Matrices	NE Processes^c
Human	1	0	0	0	0	0	0	0
	2	0	0	0	0	0	0	0^d
	3	16	16	0	3	91	107	6 (5.6)
Mouse	1	0	0	0	0	0	0	0
	2	0	0	0	0	0	0	0
	3	1	1	0	0	0	1	0
Rat	1	0	0	0	0	0	0	0
	2	0	0	0	0	0	0	0
	3	0	0	0	0	2	2	0

Open in a new tab

Inline graphic Steps to identify Non-embeddability 1. , 2. Negative eigenvalues have odd algebraic multiplicity, 3. Complex eigenvalues occur in non-conjugate pairs, 4. The set of eigenvalues, , lie outside the region in the complex plane, 5. – negative off-diagonals – threshold −0.1, NE = Non-Embeddable, Inline graphic No. rejections of from parametric bootstrap scheme with a p-value (percentage of total tests), 1 Alignment failed to find stable estimates.

Table 9. Non-Embeddability- D4 Microbial Protein Coding Gene (1140 Triads).

	STEPS ^a
Codon position	1	2	3	4	5	NE^b Matrices	NE Processes^c
1	0	0	0	0	2	2	0
2	0	0	0	0	0	0	0
3	574	591	0	470	1052	1122	27^d

Open in a new tab

Table 5. Non-Embeddability – D1 Opossum, Rat, Mouse Triad (8014 Alignments).

		STEPS ^a
Edge	Codon position	1	2	3	4	5	NE^b Matrices	NE Processes ^c
Opossum	1	0	0	0	0	4	4	0
	2	0	0	0	0	0	0	0^d
	3	117	119	0	26	638	777	43 (5.5)
Mouse	1	0	0	0	0	0	0	0
	2	0	0	0	0	0	0	0
	3	1	1	0	1	2	2	0
Rat	1	0	0	0	0	0	0	0
	2	0	0	0	0	0	0	0
	3	0	0	0	0	1	1	0

Open in a new tab