Skip to main content
Algorithms for Molecular Biology : AMB logoLink to Algorithms for Molecular Biology : AMB
. 2019 Nov 18;14:23. doi: 10.1186/s13015-019-0158-3

TMRS: an algorithm for computing the time to the most recent substitution event from a multiple alignment column

Hisanori Kiryu 1,, Yuto Ichikawa 2, Yasuhiro Kojima 1
PMCID: PMC6859643  PMID: 31832082

Abstract

Background 

As the number of sequenced genomes grows, researchers have access to an increasingly rich source for discovering detailed evolutionary information. However, the computational technologies for inferring biologically important evolutionary events are not sufficiently developed.

Results 

We present algorithms to estimate the evolutionary time (tMRS) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. As the confidence in estimated tMRS values varies depending on gap fractions and nucleotide patterns of alignment columns, we also compute the standard deviation σ of tMRS by using a dynamic programming algorithm. We identified a number of human genomic sites at which the last substitutions occurred between two speciation events in the human lineage with confidence. A large fraction of such sites have substitutions that occurred between the concestor nodes of Hominoidea and Euarchontoglires. We investigated the correlation between tissue-specific transcribed enhancers and the distribution of the sites with specific substitution time intervals, and found that brain-specific transcribed enhancers are threefold enriched in the density of substitutions in the human lineage relative to expectations.

Conclusions 

We have presented algorithms to estimate the evolutionary time (tMRS) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. Our algorithms will be useful for Evo-Devo studies, as they facilitate screening potential genomic sites that have played an important role in the acquisition of unique biological features by target species.

Electronic supplementary material

The online version of this article (10.1186/s13015-019-0158-3) contains supplementary material, which is available to authorized users.

Keywords: Phylogenetic trees, Comparative genomics, Probabilistic models

Background

As sequenced genomes continue to accumulate, a very rich source for discovering detailed evolutionary information grows. The UCSC genome browser provides multiple genome alignments for 100 vertebrate species, including humans (the multiz100way track) [13].

In previous decades, multiple DNA alignments are often used to reconstruct species trees and ancestral nucleotide states [4] and many algorithms and softwares are developed for such purposes. Some of the most used algorithms include Neighbor-Joining algorithm [5] and maximal likelihood method [4] and Bayesian Markov chain Monte Carlo method [6]. These algorithms usually assume evolutionary models that each nucleotide stochastically mutates over evolutionary time, and output the most consistent phylogenetic tree from possible (2n-3)!! rooted or (2n-5)!! unrooted trees for n-species. On the other hand, since the species tree of 100 vertebrates of multiz100way are basically resolved from the previous studies [7], finding functional genomic sites rather than determining the phylogenetic tree is becoming more important application as the use of multiple genomic alignments in recent years.

As it is difficult to visually inspect functional regions from 100-species alignments, computing genome-wide summary statistics is very important. Measuring the strength of negative or positive selection is among the most popular analyses for screening functional regions of genomes [814]. These statistics are computed using probabilistic models that model the stochastic processes of DNA mutations along phylogenetic species trees, which are used in tree reconstruction [46], and detect genomic regions that show smaller or larger mutation rates using likelihood ratio tests or similar probabilistic computations.

Such statistics have advantages over simpler statistics that do not assume a particular evolutionary model, such as nucleotide frequency of alignment columns and pairwise mismatch rates. By using a phylogenetic tree, we can appropriately count the number of ancestral mutations that are widespread within extant species. Further, stochastic processes can account for multiple nucleotide mutations whose effects are not negligible when we study evolutionarily distant species. However, only conservation/divergence measures are not sufficient to extract all evolutionarily important events from potential 4100 nucleotide patterns of a 100-species alignment column.

In this study, we develop algorithms to compute three statistics, tMRS, σ, and q, for each column of a multiple genome alignment based on an evolutionary model that is similar to those described above. tMRS is the evolutionary time to the most recent substitution event that occurred along the lineage of a given target species in the phylogenetic tree. Since the confidence in estimated tMRS values varies markedly among alignment columns depending on gap fractions and complexity of nucleotide patterns (see Fig. 1 for explanation), we also compute the standard deviation σ of tMRS. Further, we compute the probability q that there is no mutation in the target lineage because the estimated tMRS value has no meaning in such cases. By filtering out sites with non negligible probability of nucleotide conservation over the entire target lineage based on q, we can remove highly conserved sites. By comparing tMRS with speciation time points, we can categorize sites by the groups of species that share mutation effects with the target species. Such detailed information is difficult to obtain from conservation measures. Our algorithms can be a very useful tool for screening the genomic sites that may have been involved in the acquisition of unique biological features by target species.

Fig. 1.

Fig. 1

Time to most recent substitution tMRS. These schematically show the situations that may impact the confidence levels of inferred tMRS values. The leaf nodes correspond to the target species are indicated by rectangles. In the left figure, we expect the last substitution occurred between node x and y, and tMRS will be around t1 to t1+t2. In the middle figure, the pattern of alignment column is not simple, and the state of node y can be either A or G. Therefore, the inferred tMRS will have a large variance between t1 to t1+t2+t3. In the right figure, there is an ambiguous nucleotide in the column. In such cases, the inferred tMRS value is the same as that inferred from only three species, and the confidence will accordingly be lower than when all four nucleotides are known

In the next section, we describe our algorithms to compute tMRS and data processing procedures. We first explain the tMRS algorithm on a single edge of phylogenetic tree, and then generalize it to account for the entire tree. The algorithms for computing σ and q are described in Additional file 1 as they are very similar to that of tMRS. In the result section, we empirically show the correctness of our algorithms by posterior sampling of mutation history. We also show that our algorithm is fast enough to be applied to the entire human genome, and that tMRS statistic is very different from other statistics to detect evolutionary conservation/divergence of genomic sites. We then apply our algorithms to the multiz100way dataset and investigate distributions of tMRS in different genomic contexts. In particular, we investigate the correlation between tMRS distribution on the bidirectionally transcribed enhancers and tissue specificity of enhancer activities and found that brain-specific transcribed enhancers are threefold enriched in the density of tMRS that located in the human lineage.

Method

We first derive formulas for tMRS and other variables for an edge of a phylogenetic tree, and then describe how to generalize them into statistics for the entire phylogenetic tree.

Single edge case

A continuous-time Markov model for nucleotide sequence evolution can be defined by a differential equation that determines the time evolution of the probability of observing each nucleotide:

tp(ab,t)=iNucRaip(ib,t),p(ab,0)=δab,

where p(a|bt) represents the probability of observing base a at time t conditioned on base b being observed at time zero; Nuc={A,C,G,T}={1,2,3,4} represents the set of nucleotides; δij represents the Kronecker delta, which is 1 if i=j and is 0 otherwise; and R={Rij} represents the substitution rate matrix. The solution is given by a matrix exponential, which can be numerically computed by using the eigenvalue decomposition of rate matrix R=UΛU-1 (Λ=diag(λ1,,λ4)) as follows [15],

p(ab,t)=exp(tR)ab,exp(A)=1+A1!+A22!+=UetΛU-1,etΛ=diag(etλ1,,etλ4)

Similar to the scalar exponential function, a matrix exponential has an infinite product representation,

exp(tR)ab=limNQNab=limNXΩN(a,b)QXNXN-1QX1X0,

where Q=(I+tR/N). The matrix Q satisfies the condition of a transition matrix of a discrete Markov process for sufficiently large N, and our formula for tMRS can be derived via this connection to the discrete model. In the second equation, ΩN(a,b) is the set of all paths X along discrete time points 0,,N such that X={XkNuc|k=0,,N,XN=a,X0=b}. Then, the summand of the second equation can be interpreted as the probability of substitution history P(XN,XN-1,,X1|X0). In the discrete model, the random variable TMRS that represents the time to the most recent substitution is given by

TMRS=l=1N-1ltNIXN==XN-lIXN-lXN-l-1+NtNIXN==X0=l=1NltNI(XN==XN-l)-l=1N-1ltNIXN==XN-l-1=tNl=1NIXN==XN-l,

where I(·) is the indicator function. Note that in the first equation, we define TMRS=t if path X has no substitution at all. In the second line, we used I(ab)=1-I(a=b), and the two terms in the second line mostly cancel out to give the third line. Then, the expected value tMRS of TMRS is given by

tMRS(a,b,t)=ETMRS|a,b,t=l=1NtNP(XN==XN-l|a,b,t)=l=1NtNQDlQN-lab/QNab,

where QD is the diagonal part of Q.

In order to take the continuum limit (N), we use formulas such as

l=1N1NflN01f(s)dsQDl,QN-lexp(stRD),exp((1-s)tR),(s=l/N),

where RD is the diagonal part of rate matrix R. By using these formulas, tMRS can be computed using the following formulas

tMRS(a,b,t)=tZ01estRDe(1-s)tRdsab=tZi=14UaiU-1ibK(tRDaa,tλi)Z=etRab=i=14UaietλiU-1ibK(x,y)=01esxe(1-s)yds=ex-eyx-yifxyexifx=y, 1

where R=UΛU-1 and Λ=diag(λ1,,λ4) is an eigenvalue decomposition of rate matrix R. The formulas for the standard deviation σ of TMRS and probability q of no substitution can be derived in similar manners and given by

σ(a,b,t)=ETMRS2|a,b,t-tMRS2(a,b,t)ETMRS2|a,b,t=2tZi=14UaiU-1ibK(tRDaa,tλi),K(x,y)=K(x,y)xq(a,b,t)=1ZetRDaaδab.

The derivation of each above formula is described in Additional file 1.

Strand symmetric rate matrix

Let ac be the complementary nucleotide of nucleotide a. A rate matrix R is strand symmetric if it satisfies Racbc=Rab for all a,bNuc [16]. Strand non-symmetric rate matrices such as the general time reversible (GTR) model generally produce different posterior expectation values if we take the complement of an alignment column. Since there is no specific strand direction in intergenic regions and the existence of two different expectation values for a single genomic site complicates the downstream analyses, we use the most general, 6-parameter strand symmetric rate matrix. Table 1 shows the parametrization of rate matrices of strand symmetric model and GTR model. The parameters are optimized together with the edge lengths of the phylogenetic tree using the maximum likelihood method. We optimize the parameters using a LBFGSB gradient descent package [17], where we compute the gradient of likelihood function exactly using a inside-outside algorithms as described in Refs. [4, 18, 19].

Table 1.

Rate parameters

RSymmetric=αβγηδϵϵδηγβα,RGTR=πAαπAβπAγπCαπCδπCϵπGβπGδπGηπTγπTϵπTη

RSymmetric and RGTR represent the rate parameters of strand symmetric and general time reversible (GTR) models, respectively. Matrix indices are ordered such that i,j{1,2,3,4}={A,C,G,T}. π is the equilibrium distribution of the GTR model. Diagonal elements are determined by the Markov condition iRij=0

Phylogenetic tree case

To extend our algorithm to phylogenetic trees, we specify a target species that corresponds to a leaf node of a tree and consider the path from the leaf node to the root node. Each internal node along the path corresponds to the last common ancestor (concestor) [20] of the target species and some extant species. Let C=c0,,cM be the set of concestors with cM being the root node and c0 being the leaf node of the target for convenience. Further, let si be the fraction of path length between the leaf and ck, let skl=(sl-sk), and let t¯ be the total path length from the target leaf to the root. Then, the corresponding formula of Eq. 1 is obtained by dividing the integration range into sub-intervals between neighboring concestors and inserting the probabilities {γk} that emit partial alignment columns that are descendants of the sister branch of each concestor (see Fig. 2),

tMRS=t¯Z(Y)bk=1Msk-1skes01t¯RDγ1e(s-sk-1)t¯RDe(sk-s)t¯RγkesM-1,Mt¯RdsabπbZ(Y)=P(Y)γk=diag(γ(bk,1),,γ(bk,4))γ(bk,i)=jα(bk,j)p(j|i,tbk)α(n,i)=P(Y(L(n))|Xn=i), 2

where Z(Y) represents the likelihood of alignment column Y, π represents the equilibrium distribution for rate matrix R, Y(L) represents the partial alignment column for a subset of leaf nodes LL ( L: the set of all leaf nodes), a=Y(c0), bk represents the sibling node of ck-1 with parent node ck, tn represents the edge length between node n and its parent node, L(n) the descendant leaves of node n, and Xn represents the random variable that represents the nucleotide type at node n. The inside variable α(n,i)=P(Y(L(n))|Xn=i) represents the probability of emitting partial alignment column Y(L(n)) given the state at node n is fixed to i. See Fig. 2 for the relations between tree nodes and dynamic programming variables. Because the range of integration is localized only in the k-th edge in the above equation, we can compute tMRS using a dynamic programming algorithm (Algorithm 1). In graphic file with name 13015_2019_158_Figa_HTML.jpg Algorithm 1, pD(j|i,tck-1)=exp(tRD)ji represents the probability of transition ji after time t without any substitution. κ(i,j) is defined by

κ(i,j)=t¯lUilU-1ljK(sk-1,kt¯RDii,sk-1,kt¯λl).

β(n,i)=P(Y(L\L(n)),XPa(n)=i) is called an outside variable and represents the probability of emitting alignment nucleotides other than the descendants L(n) of node n with a constraint that the state of the parent node Pa(n) is fixed to i. The inside and outside variables are computed by using the inside and outside algorithms [4, 19] resembling the use of forward-backward algorithms in linear hidden Markov models. αD(ck,i) is the probability that emits the partial alignment column Y(L(n)) with no substitution along the target lineage up to concestor node ck, given the state of node n is fixed to i.

Fig. 2.

Fig. 2

Inside and outside variables. ck denotes the concestor nodes on the target lineage. bk denotes the sibling node of ck-1. α(bk,) represents the inside variable, while β(ck,) represents the outside variable. γ(bk,) represents a dynamic programming variable in Eq. 2 in the main text

Similar algorithms can be derived for the standard deviation σ and the probability of no mutation q as described in Additional file 1.

Alignment gaps and ambiguous characters

We treat gap and ambiguous nucleotide characters of non-target leaves as missing characters; we sum the probabilities of all possible nucleotide patterns in computation. Then, the probability condition indicates that the estimated values are the same as those computed from the reduced phylogenetic tree and alignment columns after removal of gaps and ambiguous characters and the corresponding edges in the tree. This increases the standard deviation σ of estimates tMRS. On the other hand, we do not consider the sites if the character of the target is a gap or an ambiguous character.

Software availability

We implemented our algorithms in the C++ language. The resulting software (‘TMRS’) is available at our website [21].

Dataset and data processing

We downloaded the MAF-formatted Multiz100way multiple alignment files from the UCSC genome browser site, which consists of multiple genome alignments of 100 vertebrate species, including the human genome version hg38. We also downloaded the phylogenetic tree data from the PhyloP track, whose edge lengths are trained using fourfold degenerate (4d) sites of RefSeq genes under the general time reversible model.

We used the topology of the PhyloP phylogenetic tree as it is, and trained only the edge lengths of the tree as well as the rate parameters of the strand symmetric model. For this, we collected alignment columns at human 4d sites based on gene annotations of the RefGene track from the UCSC site, following Siepel et al. [8] and Pollard et al. [9]. The reason for using 4d sites is the higher quality of alignments and higher coverage of distant species in the alignments [8, 9], though they may be subject to various evolutionary constraints. In order to investigate the uncertainty of trained parameters, we randomly sampled 100 sets of 4d sites from about three million 4d sites in the human genome such that each has a given number of sites, ranging from 1 to 105. We generated an alignment of concatenated genomic alignment columns, and trained parameters based on the maximum likelihood method [22], using the LBFGS-B gradient descent package [17].

For studying differences in tMRS distributions among genes, we sampled 100,000 alignment columns from intergenic, CDS, 3′UTR, and 5′UTR sequences based on ‘Gencode v24 Basic’ track gene models from the UCSC site [3].

Anderson et al. [23] identified genomic elements called transcribed enhancers in human and other genomes, where short RNAs are produced by bidirectional transcription as a result of chromatin openings. From the FANTOM5 enhancer atlas site [24], we downloaded the coordinates of transcribed enhancers and the list of tissue and cell specific enhancers where bidirectional transcription occurs in a tissue and/or cell-specific manner.

Results and discussions

Parameter optimization and performance tests

We trained rate matrix {Rij} and tree edge lengths {tk} from genomic multiple alignment columns sampled from 4d sites. We trained 100 sets of parameters with random initial points from 100 sets of random-sampled alignment columns. Figure 3 (top left) shows the distributions of pairwise relative differences of trained parameters θ={Rij},{tk} for each number of alignment columns. Here, the relative difference between two parameters θ1 and θ2 is defined by |θ1-θ2|/max(|θ1|,|θ2|) with |v| being the Euclidean norm. It shows the trained parameter converges very well as increasing the number of alignment columns. Figure 3 (top right) shows the Pearson correlation coefficient with the tree edge lengths provided in the PhyloP track of the UCSC genome browser, which was computed using the general time reversible model [9]. It shows concordant tree edge lengths (correlation coefficient >0.9) are learned despite the differences in rate matrix models. Figure 3 (bottom) shows the distributions of the tree path lengths from the leaf node of humans to its concestor nodes using parameters trained with 100,000 alignment columns. As the variance among training sets is very small, we use their mean values as the times to concestors before the present and do not consider the widths of distributions. Table 2 shows the mean rate matrix and equilibrium distribution. The average transition-transversion rate ratio is about 2.7 in this model (see Section 6 in Additional file 1 for the computation). In the following results, we use 100 sets of parameters that are trained from 100,000 alignment columns and take averages of tMRS, σ, and q computed for each parameter set.

Fig. 3.

Fig. 3

Convergence of optimized parameters. The upper left panel shows the distributions of the pairwise relative differences of inferred parameters. The x-axis represents the number of alignment columns used to train the parameters. The upper right panel represents the distribution of the correlation coefficients of tree edge lengths between the PhyloP model of the UCSC genome browser site and the inferred parameters. The x-axis is the same as that shown in the upper left panel. The bottom panel represents the distributions of inferred time to each concestor from the present. The unit is the number of substitutions per site. Each parameter set is trained using 100,000 alignment columns sampled from 4d sites

Table 2.

Trained rate matrix and equilibrium distribution

Substitution type Parameter Rate
AC, TG α 0.16
AG, TC β 0.57
AT, TA γ 0.20
CG, GC δ 0.24
CT, GA ϵ 0.59
GT, CA η 0.25
Nucleotide Equilibrium frequency π
A, T 0.23
C, G 0.27

The elements of rate matrix RSymmetric and its equilibrium frequency π are shown. Parameter variables correspond to matrix RSymmetric in Table 1. We averaged the parameters optimized using 100,000 sampled alignment columns in the 4d sites. Due to the symmetry of rate matrix, complementary nucleotides have the same equilibrium frequency

In Fig. 4, we compared (tMRS, σ, q) computed by our algorithms with the corresponding values obtained from posterior sampling of mutation histories along the phylogenetic tree in order to numerically check the correctness of our algorithms. It shows the relative errors between two values monotonically decrease as the sample size and the fineness of discretization increases.

Fig. 4.

Fig. 4

Numerical tests of our algorithms. The statistics tMRS, σ, and q computed by exact algorithms were compared with those estimated using sampled histories of nucleotide substitutions. The y-axes represent the relative difference between the values from the exact algorithms and those obtained by approximate sampling algorithms. The x-axes show the dependency on the number of sampled histories and the number of discrete points in the phylogenetic tree from which the states were sampled

Table 3 shows the runtimes of our C++ implementation. We used a single ES-2670 v3 2.3 GHz core as the computational platform. As the tMRS, σ, and q values of each alignment columns are independently computable, our algorithms can deal with the entire human genome with reasonable time using a compute cluster.

Table 3.

Runtime of our implementation

Computation Datasize Runtime
Train, gradient (1 iteration) 100 K columns 4.6 min
Train, total (300 iterations) 100 K columns 23 h
tMRS, σ, q 1 column 7.3×10-4 s
tMRS, σ, q 1 G columns 204 h

We show runtimes of our implementation. We used 100 species vertebrate multiple alignments for the measurements. For training data, we used a sampled alignment with 100 K columns from 4d sites. As for the computation of tMRS, σ, and q, we used the sampled alignments from 3UTR sequences which have 2,034,681 total alignment columns, and scaled the runtime for each Datasize

Comparison with other statistical measures

To show the significance of our algorithms, we compared the accuracy with two possible methods of estimating tMRS and q. The first method (termed ‘reconstruction’) uses the ancestral reconstruction. In this method, we first set the nucleotide state of each concestor node ck to the base ack with the maximal posterior probability:

ack=argmaxiPXck=i|Y=1Z(Y)jα(ck,i)p(i|j,tck)β(Pa(ck),j)

Then, we return the middle point of the edge between nodes ck-1 and ck as tMRS where ck is the most recent concestor whose reconstructed nucleotides differ from that of the target species ackY(c0). We set q=1 if there is no such ck and we set q=0 otherwise. The second method (termed ‘alignment’) to infer tMRS only considers nucleotides of extant species: we return the middle point of the edge between nodes ck-1 and ck as tMRS where ck is the most recent concestor such that partial alignment column Y(L(ck), which are descendants of ck, contain different nucleotide from the target nucleotides aY(L(ck)),aY(c0). Similarly to the ‘reconstruction’ method, We set q=1 if there is no such ck and we set q=0 otherwise.

To compare the accuracy of our algorithm with these approximate algorithms, we simulated evolutionary history and alignment column of base mutation using forward simulation using the phylogenetic model of the previous section. We masked nucleotide positions where there are gap or ambiguous characters in sampled multiz100way alignments in order to imitate the gap patterns of real alignments. Details of the simulation algorithm is described in Section 6 of Additional file 1. As a result, we obtained 100,000 alignment columns of 100 species with ‘true’ annotation of tMRS and q{0,1}.

Figure 5a shows accuracies of predicting the absence of mutation along the target lineage. The x-axis is the fraction of positives in the dataset which was controlled by varying threshold of q. Since the ’reconstruction’ and ’alignment’ methods assign only binary q values, only a single point is plotted for each. The y-axis represents the ratio of false positives in all the positive predictions (i.e. False Discovery Rate, FDR). It shows that FDR monotonically decreases with decreasing q threshold, indicating the correctness of our algorithm for q. It also indicates that the accuracies of absence call of reconstruction and alignment methods are similar to that of our algorithm with positive fraction 0.5 and 1.0, respectively. Figure 5b shows the mean errors of predicted tMRS relative to the total length of target lineage for each positive fraction. The error mostly decreases with stricter thresholds for our method, while reconstruction and alignment methods show more than 10% errors on average. Table 4 shows numerical values of FDR and mean error for several q threshold. Since the mean error of tMRS is less than 5% of the total length of target lineage, we will use threshold q=0.01 in the analyses in the following sections.

Fig. 5.

Fig. 5

Effect of filtering. We investigated the effect of filtering by q threshold on the accuracy of tMRS estimates using simulation dataset. The x-axis represents the fraction of alignment columns remained by filtering with varying threshold. a Fraction of alignment columns that have no mutation throughout the target lineage in the positive set. b Mean % error of tMRS values in the dataset after filtering. The blue and green points represent the approximate tMRS and q computed from the reconstruction of ancestral states, and the closest extant species whose base is different from that of the target species, respectively

Table 4.

Effects of filtering by probability q of no mutation

q threshold Positive fraction FDR for no mutation % error of tMRS
0.01 0.24 0.0013 4.4
0.1 0.36 0.015 7.9
0.5 0.68 0.16 15
1.0 1 0.33 15

We computed a few statistical measures for the simulation dataset obtained by forward sampling of base mutation history. The first column represents the threshold values qthreshold. ‘positive fraction’ represents the fraction of alignment columns with q<qthreshold. ‘FDR for no mutation’ represents the fraction of the alignment columns that have no mutation along the target lineage but satisfy q<qthreshold. ‘% error of tMRS’ represents the mean % error of estimated tMRS relative to the total edge length of the target lineage

Table 5 shows the comparison of tMRS and other statistical measures computed from genomic alignments. We used the same alignment columns in the previous paragraph but with filtering with threshold q<1 for true q values. For this dataset, we computed Spearman’s correlation coefficients with true tMRS and other indicators: tMRS(q<0.01,0.1,1) represents our algorithms with a few filtering criteria of q. ‘reconstruction’ and ‘alignment’ are the approximate methods described above with filtering based on q values computed by their respective method. ‘entropy’ represents the information entropy of base frequency of alignment column. ‘pairwise’ represents the ratio of the number of identical bases in n(n-1)/2 possible base pairs of n bases in the alignment column. ‘phastcons’ represents the posterior probability of conserved region computed by PhastCons [8]. ‘phylop’ represents the negative p-value of conservation computed by PhyloP [9]. ‘gerp’ represents the estimated number of ‘rejected mutations’ compute by Gerp++ [10]. The table shows small correlation of conservation measures (entropy, pairwise, phastcons, phylop, gerp) with tMRS and very high correlation of estimated tMRS with strict filtering criterion q<0.01. It shows our algorithms can accurately extract distinct evolutionary information which is difficult to extract with previous conservation measures.

Table 5.

Correlation with other conservation measures

Significance measure Spearman’s correlation with true tMRS
tMRS(q<0.01) 0.965
tMRS(q<0.1) 0.938
tMRS(q<1) 0.858
Reconstruction (q<1) 0.905
Alignment (q<1) 0.338
Entropy 0.301
Pairwise 0.344
Phastcons 0.112
Phylop 0.108
Gerp 0.129

We show Spearman’s correlation coefficient with the true tMRS obtained from simulation and several measures for nucleotide conservation. The first three columns represent tMRS computed by our exact algorithm with filtering by q values. ‘reconstruction’ represents approximate tMRS values estimated from reconstruction of ancestral states. ‘alignment’ represents approximate tMRS values estimated from closest extant species which has different nucleotide base from the nucleotide of the target species. ‘entropy’ represents the negative information entropy of the base frequency of alignment column. ‘pairwise’ represents the pairwise alignment similarity of alignment column. ‘phastcons’ represents the posterior probability of conservation at the alignment column. ’phylop’ represents the p-values of negative selection. ‘gerp’ represents the ‘rejected substitution’ values

Genomic distribution of tMRS

We computed the time to the most recent substitution tMRS, its standard deviation σ, and the probability q that there is no substitution for alignment columns uniformly sampled from the human genome. The scatter plot of tMRS and q values (Fig. 6 (top left)) shows the probability of no substitution q tends to increase with increasing tMRS. However, the distribution is broad depending on the nucleotide patterns of alignment columns, and a non-zero fraction of sites have deep ancestral substitutions (i.e., large tMRS and small q) within the Homo–Vertebrate lineage. The scatter plot of σ and q (Fig. 6 (right)) shows that the probability of no substitution q is very small if σ<0.1. The high peaks (the red regions) of these two figures show that a large number of alignment columns have tMRS0.7, σ0.4, and q0.3. For these sites, it is difficult to determine if there are substitutions within the interval of the Homo–Vertebrate lineage.

Fig. 6.

Fig. 6

Distributions of tMRS,σ and q in the human genome. The top panels show the sampling distribution of statistics tMRS-q (left) and σ-q (right) in the human genome. In these panels, a total of 2,063,207 alignment columns were sampled from the human genome excluding repeat regions. The bottom panels show the densities of q (left) and tMRS (with q<0.01) (right) for several types of genomic region: CDS, 5UTR, 3UTR, Intron, and Intergenic

Figure 6 (bottom left) shows the density of q for each annotated genomic region. Compared to Intergenic, Intron, 3′UTR, and 5′UTR, CDS regions have a large fraction of sites with a high probability of no mutation, indicating many ancestral nucleotides that were fixed before the appearance of the vertebrate concestor. Since computed tMRS values have less meaning if q is large, we filtered out sites with q>0.01 and plotted the distributions of tMRS values for the remaining sites (Fig. 6 (bottom right)). There are several peaks because some sites are guaranteed to experience the last substitution between specific interval of concestors. All regions have the highest peak around tMRS0.1, which is between the Simiiformes and Primate concestors. CDS regions have a large peak around tMRS0.36, which corresponds to between the Eutheria and Theria concestors.

Concestor interval of the last substitution event

We are generally interested in the substitutions that are associated with the evolution of unique features in the species that inherited them. In this respect, we want to know in which interval between two speciation events (i.e., between two concestor nodes) each tMRS is located. In order to simplify the presentation, we reduced the concestor nodes from the full 19 concestors of the PhyloP tree to eight as shown in Table 6 and Fig. 7 in the following analyses of concestor intervals. Since the estimated tMRS values can have a large standard deviation σ, we consider intervals between all pairs of concestor nodes: Homo–Hominoidea, Homo–Mammalia, Mammalia–Vertebrata, etc. Then, we assign a concestor interval I to a site if q<0.01 and if I is the smallest interval that contains a confidence interval [tMRS-2σ,tMRS+2σ]. Only about 4% of sites were assigned to any concestor interval by this method. Figure 8 (top) shows the frequency distribution of genomic sites that are assigned to some concestor interval, which shows that many sites are assigned to concestor intervals Hominoidea–Euarchontoglires, Hominoidea–Eutheria, or Homo–Euarchontoglires. Figure 8 (bottom) shows the same frequency distributions for each category of annotated genomic regions. The distributions, except that of CDS, are similar to each other. On the other hand, CDS regions have many deep ancestral intervals.

Table 6.

Evolutionary time of reduced concestors

Concestor Time Sibling Descendants
Homo 0 Human
Hominoidea 0.026 Hylobatidae Gibbon
Euarchontoglires 0.17 Glires Mouse, rabbit
Eutheria 0.22 Atlantogenata Elephant, armadillo
Mammalia 0.55 Prototheria Platypus
Amniota 0.69 Sauropsida Bird, reptile
Tetrapoda 0.80 Amphibia Frog
Vertebrata 1.1 Cyclostomata Lamprey

We used the following set of reduced concestors in the analyses of concestor intervals. Each concestor was named based on the corresponding taxonomic class of descendants. ‘Time’ represents phylogenetic time from the present in units of substitutions per site. ‘Sibling’ represents the sibling clade that departed from the human lineage at each concestor. ‘Descendants’ represent some examples of extant species in the sibling clade. Topological relationships are shown in Fig. 7

Fig. 7.

Fig. 7

Topological relationship of reduced concestors. We show the topology of simplified phylogenetic tree of 100 vertebrate species used in the analyses of concestor intervals. See Table 6 for the numerical values of evolutionary time

Fig. 8.

Fig. 8

Frequency of concestor intervals. The two panels show the frequencies of genomic sites categorized by the concestor intervals where their most recent substitutions occurred. The left panel shows the genomic distribution. The axes represent late (x-axis) and early (y-axis) ends of intervals. The right panel shows distributions for several types of genomic region: CDS, 5′UTR, 3′UTR, Intron, Intergenic, and Transcribed Enhancer. Only intervals with non-zero counts are shown in this panel

Tissue-concestor interval correlations for transcribed enhancers

Andersson et al. [23] identified genomic elements called transcribed enhancers in the human genome and other genomes where short RNAs are produced by bidirectional transcription as a result of chromatin opening. They showed transcribed enhancers often overlap with protein-binding marks such as ChIP-seq peaks or protein-binding motifs. They are also enriched in disease-associated single nucleotide polymorphisms (SNPs). Many transcribed enhancers are tissue-specific in that bidirectional transcription of short RNAs occurs frequently in specific tissues. They showed the expressions of a number of genes are well explained by those of a few transcribed enhancers upstream of the genes. Thus, we can see tissue specific enhancer activities for these transcribed enhancers. In the FANTOM5 enhancer atlas site [24], tissue-specific enhancers are annotated by using the UBERON tissue anatomy ontology and Cell Ontology [25, 26]. For example, 41 diverse tissues were assigned to 10-1335 differentially-expressed enhancers (see Additional file 1: Table S2) [24]. Using these data, we studied the tissue and concestor interval of the last substitution event as an example of screening evolutionarily important events that affected life designs of extant organisms. We computed (tMRS,σ,q) for each site of the transcribed enhancer regions, filtered out the sites with q>0.01, and associated concestor intervals as described above. For each concestor interval, we list the enhancers that contain sites associated with the interval. We used the hypergeometric test to determine if the sites corresponding to a specific concestor interval are significantly enriched for the enhancers transcribed in a specific tissue type. Table 7 shows tissues that have the top five most significant p-values for some concestor interval (more details are discussed in Section 7 of Additional file 1). We find that the brain and Homo–Vertebrata interval association has the most significant p-value and Homo–Vertebrata sites are enriched threefold in brain-associated enhancers relative to expectations. The second tissue was meninx, which is also associated with the nervous system (Table 7). Figure 9 shows a few sampled alignment columns in a brain-specific enhancer, which are assigned to the Homo–Vertebrata interval. Alignment columns that have three or more nucleotides suggest there are some substitutions along the Homo–Vertebrata lineage, but the patterns of nucleotide types and the number of gaps makes it difficult to determine at what time point the substitution occurs. Thus, the assigned intervals are the most ambiguous for these alignment columns.

Table 7.

Tissue-concestor interval correlation for transcribed enhancers

Tissue Interval Z-score -log10(p-value) Enrichment Observed
Brain Homo–Vertebrata 14.2 32.0 2.97 140
Hominoidea–Tetrapoda 13.3 29.2 2.69 148
Hominoidea–Vertebrata 11.5 22.1 2.17 115
Meninx Hominoidea–Tetrapoda 8.58 11.7 3.84 32
Hominoidea–Vertebrata 6.68 7.72 3.52 23
Hominoidea–Amniota 6.06 6.96 3.01 25
Eye Hominoidea–Vertebrata 8.43 11.5 3.50 37
Eutheria–Tetrapoda 8.42 7.44 9.69 9
Eutheria–Vertebrata 8.06 8.98 5.13 19

The top three tissues that have transcribed enhancers with sites corresponding to specific concestor intervals are shown. Within each tissue, the top three concestor intervals are shown. The sorting order is based on Z-scores that are based on the hypergeometric test and indicate the significance of enrichment of specific concestor intervals in tissue-specific enhancers. ‘-log10(p-value)’ is the minus log10 p-value of the test computed using the phyper() function in the R programming language. ‘Enrichment’ is the fold enrichment within the concestor interval relative to the expected occurrence by random sampling. ‘Observed’ is the number of transcribed enhancers that have both attributes of the Tissue and Interval columns. see Section 8 in Additional file 1 about the use of Z-scores for ranking tissues

Fig. 9.

Fig. 9

Examples of alignment columns. The figure shows the examples of alignment columns that include the concestor interval Homo–Vertebrata in the transcribed enhancer regions and show brain-specific RNA transcription. The y-axis represents nine example alignment columns and x-axis represents nucleotides of each column, in which gaps, ambiguous nucleotides, and unaligned regions are shown as blank. The species are aligned such that it conforms phylogenetic trees and sorted such that species more evolutionarily distant from humans are placed on the right

Tissue-concestor interval correlations for genes

We studied the correlation between the tissue-specificity and concestor intervals for genes in a similar manner as for transcribed enhancers. See Section 9 in Additional file 1 for detailed description of the method. Table 8 shows the top three tissues that have genes with sites corresponding to specific concestor intervals are shown. Within each tissue, the top three concestor intervals are shown. As compared to the corresponding Table 7 for transcribed enhancers, deeply ancestral intervals appear in the table, indicating the high level of conservation of exonic sequences. On the other hand, fold enrichment of concestor intervals are smaller than in transcribed enhancers which make it more difficult to infer the impact of the most recent mutations on the life design of extant species than in the case of transcribed enhancers.

Table 8.

Tissue-concestor interval correlation for genes

Tissue Interval Z-score -log10(p-value) Enrichment Observed
Muscle Eutheria–Tetrapoda 6.44 11.3 1.26 250
Eutheria–Amniota 6.23 11.9 1.18 287
Eutheria–Mammalia 5.22 10.8 1.09 307
Artery aorta Eutheria–Tetrapoda 5.73 9.76 1.36 122
Eutheria–Amniota 4.22 6.20 1.18 131
Euarchontoglires–Eutheria 4.16 5.11 1.33 99
Pineal gland Euarchontoglires–Eutheria 5.50 8.05 1.28 217
Eutheria–Amniota 5.45 9.04 1.15 290
Eutheria–Tetrapoda 5.21 7.60 1.21 247

The top three tissues that have genes with sites corresponding to specific concestor intervals are shown. Within each tissue, the top three concestor intervals are shown. The sorting order is based on Z-scores that are based on the hypergeometric test and indicate the significance of enrichment of specific concestor intervals in tissue-specific genes. ‘-log10(p-value)’ is the minus log10 p-value of the test computed using the phyper() function in the R programming language. ‘Enrichment’ is the fold enrichment within the concestor interval relative to the expected occurrence by random sampling. ‘Observed’ is the number of Entrez genes that have both attributes of the tissue and interval columns

Conclusions

We have developed algorithms to infer the time tMRS to most recent substitution in the lineage from a given target species to the root of a phylogenetic tree. In order to filter out highly conserved sites and ambiguous sites where the confidence of estimated tMRS is low, we also compute the probability q of no mutation and the standard deviation σ of tMRS. We computed these variables efficiently using dynamic programming algorithms on the phylogenetic tree such that the algorithms can be applied to multiple genomic alignments with 100 species. We have empirically checked the correctness of our algorithms by posterior sampling of mutation histories on the tree. Our algorithms are exact under the assumptions of the model: genome evolution follows a site-independent continuous-time Markov process along the phylogenetic tree. Our results also depend on the quality of Multiz alignment, which was debated previously [27]. Although alignment errors can be less influential if the corresponding leaf nodes are far from the target lineage, the incomplete coverage of sequenced genomes directly affects the number of sites whose tMRS can be determined with confidence. We expect that the number of sites with confident tMRS value will increase as the coverage of genome sequences improve in the future.

We have applied our tool to 100-species multiple genome alignments with human target and obtained a frequency spectrum of concestor intervals that categorized the time points at which the last substitutions occurred. Furthermore, we studied the correlation between the frequency of concestor intervals and the tissue-specificity of transcribed enhancers and found that brain-specific transcribed enhancers are highly enriched among the sites with mutations in the human lineage. It may be very interesting to combine our method with genome editing experiments to see if nucleotide changes at the screened sites affect tissue functions.

Supplementary information

13015_2019_158_MOESM1_ESM.pdf (78.1KB, pdf)

Additional file 1. Detailed description of TMRS algorithms.

Acknowledgements

This work was supported by JSPS KAKENHI [Grant Numbers 16H01532, 17K00398] (H.K.).

Authors' contributions

HK designed the project, developed the algorithms, and wrote the manuscript. KI and YK contributed to the development of algorithms and their implementation and computational experiments in the early stages of the study. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Hisanori Kiryu, Email: kiryu-h@k.u-tokyo.ac.jp.

Yasuhiro Kojima, Email: kojima_yasuhiro_15@stu-cbms.k.u-tokyo.ac.jp.

Supplementary information

Supplementary information accompanies this paper at 10.1186/s13015-019-0158-3.

References

  • 1.Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14(4):708–715. doi: 10.1101/gr.1933104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002;12(6):996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.UCSC Genome Browser. http://genome.ucsc.edu/. Accessed 15 Jun 2018.
  • 4.Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17(6):368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
  • 5.Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
  • 6.Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 2001;17(8):754–755. doi: 10.1093/bioinformatics/17.8.754. [DOI] [PubMed] [Google Scholar]
  • 7.Murphy WJ, Eizirik E, O’Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW, Springer MS. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001;294(5550):2348–2351. doi: 10.1126/science.1067179. [DOI] [PubMed] [Google Scholar]
  • 8.Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15(8):1034–1050. doi: 10.1101/gr.3715005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20(1):110–121. doi: 10.1101/gr.097857.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15(7):901–913. doi: 10.1101/gr.3577405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Garber M, Guttman M, Clamp M, Zody MC, Friedman N, Xie X. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics. 2009;25(12):54–62. doi: 10.1093/bioinformatics/btp190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Gu X, Fu YX, Li WH. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol Biol Evol. 1995;12(4):546–557. doi: 10.1093/oxfordjournals.molbev.a040235. [DOI] [PubMed] [Google Scholar]
  • 13.Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994;39(3):306–314. doi: 10.1007/BF00160154. [DOI] [PubMed] [Google Scholar]
  • 14.Siepel A, Pollard KS, Haussler D. New methods for detecting lineage-specific selection. In: Apostolico A, Guerra C, Istrail S, Pevzner PA, Waterman M, editors. Research in computational molecular biology. Berlin: Springer; 2006. [Google Scholar]
  • 15.Yang Z. Computational molecular evolution. Oxford: Oxford University; 2006. [Google Scholar]
  • 16.Karro JE, Peifer M, Hardison RC, Kollmann M, von Grunberg HH. Exponential decay of GC content detected by strand-symmetric substitution rates influences the evolution of isochore structure. Mol Biol Evol. 2008;25(2):362–374. doi: 10.1093/molbev/msm261. [DOI] [PubMed] [Google Scholar]
  • 17.Zhu C, Byrd RH, Norcedal J. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization. ACM Trans Math Softw. 1997;23(4):550–560. [Google Scholar]
  • 18.Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004;21(3):468–488. doi: 10.1093/molbev/msh039. [DOI] [PubMed] [Google Scholar]
  • 19.Kiryu H. Sufficient statistics and expectation maximization algorithms in phylogenetic tree models. Bioinformatics. 2011;27(17):2346–2353. doi: 10.1093/bioinformatics/btr420. [DOI] [PubMed] [Google Scholar]
  • 20.Dawkins R. The Ancestor’s tale. London: Weidenfeld and Nicolson; 1970. [Google Scholar]
  • 21.TMRS Software. https://github.com/hmatsu1226/SCODE. Accessed 15 Jun 2018.
  • 22.Fisher R. On the mathematical foundation of theoretical statistics. Philos Trans R Soc Lond Ser A. 1922;222:309–368. [Google Scholar]
  • 23.Andersson Rea. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507(7493):455–461. doi: 10.1038/nature12787. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.FANTOM5 human enhancer tracks. http://slidebase.binf.ku.dk/human_enhancers/. Accessed 15 Jun 2018.
  • 25.Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA. Uberon, an integrative multi-species anatomy ontology. Genome Biol. 2012;13(1):5. doi: 10.1186/gb-2012-13-1-r5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biol. 2005;6(2):21. doi: 10.1186/gb-2005-6-2-r21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Frith MC, Park Y, Sheetlin SL, Spouge JL. The whole alignment and nothing but the alignment: the problem of spurious alignment flanks. Nucleic Acids Res. 2008;36(18):5863–5871. doi: 10.1093/nar/gkn579. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

13015_2019_158_MOESM1_ESM.pdf (78.1KB, pdf)

Additional file 1. Detailed description of TMRS algorithms.


Articles from Algorithms for Molecular Biology : AMB are provided here courtesy of BMC

RESOURCES