Detecting Positively Selected Sites From Amino Acid Sequences: An Implicit Codon Model

Zheng Ouyang; Jie Liang

doi:10.1109/IEMBS.2007.4353538

. Author manuscript; available in PMC: 2012 Jul 27.

Published in final edited form as: Conf Proc IEEE Eng Med Biol Soc. 2007;2007:5302–5306. doi: 10.1109/IEMBS.2007.4353538

Detecting Positively Selected Sites From Amino Acid Sequences: An Implicit Codon Model

Zheng Ouyang ¹, Jie Liang ¹

PMCID: PMC3406752 NIHMSID: NIHMS391030 PMID: 18003204

Abstract

Fixation of advantageous mutations is an important evolutionary force driving the accelerated protein diversification. However, the standard phylogenetic approach to infer positive selection is based on relative rate of nonsynonymous to synonymous substitutions, and requires the knowledge of DNA sequences, hence precludes its application to family of remotely related sequences where saturated substitutions occur. In this study, we develop a new method to detect positive selection directly from amino acid sequences by treating codon usage as hidden parameters.

For a given amino acid sequence set and a phylogenetic tree, we use a reversible continuous time Markov process as our evolutionary model. This model has fewer parameters than normal amino acid evolutionary model, with only transition/transversion rate ratio, nonsynonymous/synonymous rate ratio (ω = d_N/d_S), and codon usage. Similar to earlier work, we assume that ω is a random variable with different probabilities to take a set of discrete values. Those with ω > 1 model sites under positive selection. We use the Bayesian Monte Carlo method to estimate model parameters, as it allows implementation of complex model of sequence evolution. Here unobserved DNA sequences are sampled from protein sequences based on distributions parametrized by codon usages, based on the fact that both protein sequences and the native protein-encoding DNA sequences have the same phylogenetic tree. The object is that sampled DNA sequences should fit the same phylogenetic tree as well as the native DNA sequences. Data set of β-globin sequences from vertebrates is used to verify our model. We are able to detect all eight positive selection sites, which were originally reported using native nucleotide sequences. Our work shows that although nonsynonymous/synonymous rate ratio is defined at codon level, it can be used to detect selective pressures of amino acid sequences by our implicit codon-based model.

I. INTRODUCTION

Inferring selection pressure at individual amino acid sites provides an important approach for studying the mechanisms of protein evolution and function [22]. Several methods have been developed that can detect functional important sites based on evolutionary conservation [2, 7, 12, 20]. However, high levels of variability also signify functional importance [1, 9–11, 14, 17, 24]. The presence of such residues experiencing positive selection can be inferred from the observation that the rate of nonsynonymous nucleotide substitution d_N is higher than that of synonymous substitution d_S in protein-coding genes [6, 23]. One expects ω = d_N/d_S= 1 if no Darwinian selection is acting on the DNA sequences; ω < 1 (selection against new mutations) if there is negative selection; and ω > 1 (selection for new mutations) if there is positive selection. Since d_N/d_S ratio is a proxy for the strength of selection, it can be used to search for regions of functional importance. For example, Sawyer et al. identified a small segment of the primate TRIM5α protein that experiences positive selection [18], including those involved in mutagenesis, confirming the importance of the segment in species-specific retroviral inhibition. In fact, many proteins have been detected to be under positive selection, which may be involved in immunity against viral attacks, reproduction, and acquirement of new functions after gene duplications [?].

Although several methods have been developed for detection of the presence of positive selection [11, 19], they are based on an explicit codon substitution model, and all requires the knowledge of the DNA sequences. This precludes the application of these methods to remotely related protein families, where saturated substitutions occur. Pupko et al. developed a method for detecting positive selection from amino acid sequences [15]. However, this counting method is based on the calculation of chemical-distances between residues, and relies on definitions of conservative and radical substitution of residues solely on the physiochemical properties of residues. This approach therefore is subjective, and may lead to ambiguous conclusions.

In this study, we develop a new model to estimate selection pressure and to infer adaptive evolution using amino acid sequences alone as input. Taking a Markovian process as the model of codon substitution, our method can estimate ω = d_N/d_S ratio at individual amino acid sites through an implicit codon model by translating the amino acid sequences back into likely codon sequences. We use a Bayesian approach to estimate our model parameters, including the ω ratios [5] and the probabilities of the usage of individual codons at an amino acid site.

II. Materials and Methods

A. Markov model of codon substitution

We use a reversible continuous time Markov process as our evolutionary model [3]. We assume mutations occur at the three codon positions independently, and therefore only single-nucleotide substitutions to occur instantaneously. Mutations involving more than one position will have very small probabilities of occurrence and will be ignored. At codon level, the states of the Markov process are the set Inline graphic of 6l sense codons. The three nonsense (stop) codons are not considered in the model, as mutations to or from stop codons can be assumed to affect drastically the structure and function of the protein and therefore will rarely survive. We use a 61 × 61 rate matrix Q, whose entries q_ij are substitution rates of codons at an infinitesimally small time interval. Specifically, we have: Q = {q_ij}, where the diagonal element is q_i,i = − Σ_i,j_≠_i q_i,j so that row sums of Q equal zero. The codon substitution model specifies the relative instantaneous substitution rate from codon i to j at one site is:

q_{i j} = {\begin{array}{l} 0, & if i and j differ at 2 or 3 positions, \\ π_{j}, & for synonymous transversion, \\ κ π_{j}, & for synonymous transition, \\ ω π_{j}, & for nonsynonymous transversion, \\ ω κ π_{j}, & for nonsynonymous transition . \end{array}

Parameter κ is the transition/transversion rate ratio, ω is the nonsynonymous/synonymous rate ratio, and π_j is the stationary frequency of codon j. In this model, κ and π are common for all sites, and ω = d_N/d_S ratio vary among sites,

The transition probability matrix of size 61×61 after time t is [8]:

P (t) = {p_{i j} (t)} = P (0) exp (Q \cdot t),

where P(0) = I. Here p_ij(t) represents the probability that codon i will mutate into codon j in time interval t. To ensure that the nonsymmetric rate matrix Q is diagonalizable for easy computation of P(t), we follow the reference [21] and insist that Q takes the form of Q = S · D, where D is a diagonal matrix who entries are the composition of codons, and S is a symmetric matrix whose entries need to be estimated.

B. Codon usage and DNA sequence sampling

Because of the degeneracy of the genetic code, the problem of generating a reliable nucleotide sequence from an amino acid sequence of a protein is complex [13]. Since the amino acid sequence has the same phylogenetic tree as the native nucleotide sequence, our aim is to find probable DNA sequences compatible to the amino acid sequences which can fit the phylogenetic tree well. The probability of each compatible DNA sequence will be estimated. This is a more realistic goal than finding the exact native DNA sequence. For amino acid residue type k, let the number of synonymous codon for residue k be s_k. The unequal usage of the s_k codons can be modeled by assigning s_k weights w₁, ···, w_{s_k}. Since they sum to 1, there are s_k − 1 free parameters to be estimated. According to the number of synonymous codon listed in Table I, there are 3 × (6 − 1) + 5 × (4 − 1) + 1 × (3 − 1) + 9 × (2 − 1) + 2 × (1 − 1) = 41 free codon usage parameters. We use this set of parameters Inline graphic = (w₁, ···, w₄₁) to sample putative DNA sequences that are compatible to the given amino acid residue sequence.

Table I.

Amino acids and the number of synonymous codon.

Amino acid	L, S, R	VAG P, T	I	F, C, Y, Q, N H, D, E, K	M, W
#synonymous codon	6	4	3	2	1

Open in a new tab

C. Likelihood function of a fixed phylogeny

We assume that a reasonably accurate phylogenetic tree T = ( Inline graphic , ) is given. Here is the set of nodes, namely, the union of the set of observed s sequences (leaf nodes), and the set of s − 1 ancestral sequences (internal nodes). E is the set of edges of the tree. For an alignment of s codon sequences of length n, let the vector x_h = (x₁, ···, x_s)^T represent the observed codons at position h for the s sequences, h ranges from 1 to n. Without loss of generality, we assume that the root of the phylogenetic tree is an internal node k. For node k and node l separated by divergence time t_kl, the time reversible probability of observing residue x_k in a position h at node k and residue x_l of the same position at node l is:

π_{x_{k}} p_{x_{k} x_{l}} (t_{k l}) = π_{x_{l}} p_{x_{l} x_{k}} (t_{k l}) .

Given a set Inline graphic of s DNA sequences (x₁, ···, x_n) translated from multiple-aligned amino acid sequences of length n based on proposed codon usage , the specified topology of the phylogenetic tree T and the set of edges, the probability of observing the s number of codons x_h at position h is a sum over all possible codon assignments to the interior nodes of the phylogenetic tree:

f (x_{h} ∣ W, T, κ, ω_{h}, π) = π_{x_{k}} \sum_{\begin{array}{l} i \in I \\ x_{i} \in C \end{array}} \prod_{(i, j) \in E} p_{x_{i} x_{j}} (t_{i j}) .

Let ω = (ω₁, ···, ω_n). Assuming independence of the substitutions among residues, after summing over the set Inline graphic of all possible codon types for the internal nodes , the probability of observing multiple-aligned amino acid sequences is:

\begin{array}{l} f (S ∣ W, T, κ, ω, π) = f (x_{1}, \dots, x_{n} ∣ W, T, κ, ω, π) \\ = \prod_{h = 1}^{n} f (x_{h} ∣ W, T, κ, ω, π) . \end{array}

To detect positive selection, we follow the “M3” model of Yang et al. (2000) [23]. There are three categories of the nonsynonymous/synonymous rate ratio ω at each site. Let the proportions of codon sites in the different categories at a site be p₁, p₂, and p₃, where p₁ + p₂ + p₃ = 1, and let the corresponding ω be ω₁ < ω₂ < ω₃. Categories with ω > 1 model sites under positive selection. Let p₁ = (p₁_, ₁, ···, p₁_,n), p₂ = (p₂_,₁, ···, p₂_,n), and p₃ = (p₃_,₁, · · ·, p₃_,n), and denote p = (p₁, p₂, p₃). The probability of observing data Inline graphic is:

f (S ∣ W, T, κ, π, ω, p) = \prod_{h = 1}^{n} (\sum_{K_{i} = 1}^{3} f (S_{h} ∣ W, T, κ, π, ω_{K_{i}}, K_{i}) p_{K_{i}}),

(1)

where K_i is the selection category at position h and π is the stationary frequency of codon. This can be used to calculate the log-likelihood function ℓ = log f( Inline graphic | , T, κ, π, ω, p). When the tree is given and π is fixed to the empirical frequencies in the sampled DNA sequences, this model only has 47 parameters. Among these, ω₁, ω₂, ω₃, p₁, p₂ are the 5 site-specific parameters to be estimated, κ is a site-independent parameter, and Inline graphic contains 41 parameters. We use θ to denote all these parameters that need to be estimated.

D. Bayesian estimation

Our goal is to estimate the values of the free parameters. Here we adopt a Bayesian approach, where parameter estimates are obtained from samples drawn from the posterior probability distribution of the parameter.

We use a prior distribution π (θ) to encode our past knowledge of the free parameters. We then describe btheta by a posterior distribution π (θ|S), which summarizes prior information available on btheta and the information contained in the observations S of multiple sequence alignment.

After integrating the prior information π (θ) and the likelihood function f( Inline graphic |θ, T) (Eqn 1), the posterior distribution π (θ| , T) can be estimated up to a constant as:

π (θ ∣ S, T) \propto \int f (S ∣ θ, T) \cdot π (θ) d θ .

Our goal is to estimate the posterior means of parameters in θ:

E_{π} (θ) = \int θ \cdot π (θ ∣ S, T) d θ .

E. Prior distribution

We choose “noninformative” priors that have the smallest effect on the results of the analysis. For ratio parameters κ, the prior is taken as a beta distribution B(κ; α = 1.0, β = 1.0). The other ratio parameters ω₁, ω₂ and ω₃ are also drawn from the beta distribution B(κ; α = 1.0, β = 1.0).

The family of Dirichlet distributions is the obvious choice for specifying priors of codon usage Inline graphic . Dirichlet priors assign densities to groups of parameters that measure proportions (i.e., parameters that must sum to 1). For a specific amino acid type k, the Dirichlet prior of codon usage π (w₁, ···, w_{s_k} = Dir(w₁, · · ·, w_{s_k}|α₁, · · ·, α_{s_k}) has s_k parameters, each corresponds to the relative frequencies of one synonymous codon. We use uniform Dirichlet priors by setting all α_j to 1.0, which means that every combination of the parameters is assigned the same prior density. The prior distribution π (p₁, p₂, p₃) for the selection category of ω frequencies is similarly taken as Dir(p₁, p₂, p₃|α₁, α₂, alpha₃), with the assignment of α₁ = α₂ = α₃ = 1.0.

F. Positively selected sites

In this work, we follow an approach similar to Huelsen-beck and Dyer 2004 [5] and take the advantage of a full Bayesian approach to determine the probability that each amino acid site is under positive selection. Assume that ω_{K_i=3} > 1.0. The probability that the i-th codon site is in positive class K_i = 3 is obtained by integrating over all possible combinations of transition/transversion rate ratios, nonsynonymous/synonymous rate ratios and codon usage. The likelihood of each site is calculated under several three ω values and then the values are summed to give the site likelihood. The posterior probability of the site being positively selected is the proportion of this sum originating from categories that are positively selected (ω > 1).

f (K_{i} = 3 ∣ S) = \frac{f (S ∣ K_{i} = 3) p_{3}}{\sum_{j = 1}^{3} f (S ∣ K_{i} = j) p_{j}}

Sites at which this probability is larger than a threshold value (say, 90, 95, or 99%) are identified as potentially under positive selection.

G. Markov Chain Monte Carlo

Since it is impossible to directly integrate the marginal distributions for posterior probability calculation, we run a Markov chain to generate samples drawn from the target distribution π (θ| Inline graphic , T). Starting from θ_t at time t, we generate a new θ_t₊₁ using the proposal function: T (θ_t, θ_t₊₁). The proposed new matrix θ_t₊₁ will be either accepted or rejected, depending on the outcome of an acceptance rule r(θ_t, θ_t₊₁). Equivalently, we have:

θ_{t + 1} = A (θ_{t}, θ_{t + 1}) = T (θ_{t}, θ_{t + 1}) \cdot r (θ_{t}, θ_{t + 1}) .

To ensure that the Markov chain will reach stationary state, we need to satisfy the requirement of detailed balance, i.e.,

π (θ_{t} ∣ S, T) \cdot A (θ_{t}, θ_{t + 1}) = π (θ_{t + 1} ∣ S, T) \cdot A (θ_{t + 1}, θ_{t}) .

This is achieved by using the Metropolis-Hastings acceptance ratio r(θ_t, θ_t₊₁) to either accept or reject θ_t₊₁, depending on whether the following inequality holds:

u \leq r (θ_{t}, θ_{t + 1}) = min {1, \frac{π (θ_{t + 1} ∣ S, T) \cdot T (θ_{t + 1}, θ_{t})}{π (θ_{t} ∣ S, T) \cdot T (θ_{t}, θ_{t + 1})}},

where u is a random number drawn from the uniform distribution Inline graphic [0, 1]. With the assumption that the underlying Markov process is ergodic, irreducible, and aperiodic, a Markov chain generated following these rules will reach the stationary state [16].

We collect m correlated samples of the θ matrix after the Markov chain has reached its stationary state. The posterior means of the rate matrix are then estimated as:

E_{π} (θ) \approx \sum_{i = 1}^{m} θ_{i} \cdot π (θ_{i} ∣ S, T) .

In this paper, the state space of this Markov chain includes transition/transversion rate ratios κ, codon usage parameters Inline graphic , probabilities of being in three selection classes and ω values for those selection classes. The chain is constructed by randomly selecting a parameter, proposing a new state for the parameter, and deciding whether the new state is accepted or rejected.

H. Move set

The move set determines the proposal function, which is critical for the rapid convergency of a Markov chain. For parameters ω₁, ω₂ and ω₃, they were initially set to 0.1, 1.0, and 3.0, respectively. Their current values were changed by adding or subtracting with equal probability a random value drawn uniformly from intervals of widths, 0.1, 0.5 and 1.0, respectively. If the proposed new value $ω_{i}^{t + 1}$ for current $ω_{i}^{t}$ is outside of the range of allowed values, $∣ ω_{i}^{t + 1} - ω_{i}^{t} ∣$ is added/substracted to $ω_{i}^{t}$ so the new value is within the valid range (10×⁻⁵ < ω₁ < 0.1, 0.1< ω₂<3, 1< ω₃< 10×³). The κ parameter was initialized to 1.0, then took new value as the ratio of two Dirichlet proposal values. The valid range is between 0.001 and 1000. For the probabilities of belonging to one of the three classes (p₁, p₂, p₃), the new values ( $p_{1}^{t + 1}, p_{2}^{t + 1}, p_{3}^{t + 1}$ ) are sampled from a Dirichlet distribution Dir((p₁, p₂, p₃|α₁, α₂, α₃), where α_i is set to $p_{i}^{t} / \sum_{i} p_{i}^{t}$ . Similarly, the parameters $W^{t + 1} = (w_{1}^{t + 1}, \dots, w_{S_{k}}^{t + 1})$ for codon usage are drawn from a Dirichlet distribution Dir(w₁, · · ·, w_{s_k} |α₁, · · ·, α_{s_k}), where α_i is set to $w_{i}^{t} / \sum_{i} w_{i}^{t}$ . Since the sampling space of codon usage is much larger than that of other model parameters, we assigned 65% of the moves for changes in Inline graphic , and 35% of the moves for all other model parameters. We assume residues in sequences belonging to the same species are likely to adopt similar codon usage, we constrain that amino acids in one sequence has the same codon usage. This improves the mixing of the Markov process. Further improvement can be obtained by assigning initial values of codon usage parameters Inline graphic based on statistics of condon frequency in genomic sequences.

I. Dataset

We use a dataset of 17 β-globin sequences from vertebrate species. Each sequence contains 144 amino acids. Protein sequences were translated from the dataset which was originally collected by Yang et al. [23, 25] from the EMBL and GenBank databases. Phylogenetic tree was built by ProML [4], which implements a maximum likelihood estimator for protein amino acid sequences. To reduce the computational complexity, we fix the phylogenetic tree during entire Markov process. The amino acid sequences used here is available at (http://gila.bioengr.uic.edu/lab/dataset/beta-globin.fasta).

III. Results

This method was implemented in C, and part of the data structure, functions for matrix operation and statistical distributions were adapted from the program MrBayes v3.1 [5]. Multiple amino acid sequences alignment of β-globin sequences and the phylogenetic tree were used as input, and the algorithm was run with different number of Markov moves ranging form 10,000 to 400,000. Samples were taken for every 100th moves. The first 200,000 samples were discarded as the chain is still in the “burning-in” period. In this study, we only collect samples if there are positively selected sites with posterior probability > 0. Figure 1 shows the log probability of observing the data for each sampled state at different time steps. The chain was started with a fixed tree and branch lengths. The likelihood of the initial state is poor, and the chain quickly found parameters that could explain the data better. After about 225,000 steps, a plateau in the log likelihood is reached. The probabilities that each site was under the three different types of selection pressure were calculated for each sample. Figure 2 shows the average posterior probability of individual site under positive selection. When the threshold for the posterior probability is set to 99.9%, the detected positively selected sites are completely in agreement with results of Yang et al, which were derived from native nucleotide sequences (Table II). We also test protein sequences of HIV-1 env V3 region. The DNA sequences were analyzed by Yang et al. (2000) and sites 28, 66, and 87 are identified as under positive selection pressure with high posterior probability. Among these, our method detected site 28 to be under significant positive selection pressure (p>98%), and site 66 under weak positive selection pressure.

The log likelihood of the current state over the course of the MCMC analysis.

The average posterior probability of each site being in the positively selected class

Table II.

Positively selected sites with high average posterior probability.

Site	Average Posterior Probability
Site	MCMC(100,000 Samples)	MCMC(400,000 Samples)

*7	0.996870	0.999973
*42	0.999477	0.999979
*48	0.999562	0.999830
*50	0.999998	0.999990
*54	0.973350	0.999897
*67	0.999987	0.999521
70	0.977635	0.006991
74	0.897890	0.251067
*85	0.993659	0.999806
87	0.999981	0.333273
110	0.597621	0.355364
*123	0.999434	0.999060

Open in a new tab

IV. CONCLUSION

Studying the evolutionary history of proteins is an essential task. A powerful method for detecting selection pressure is considering the ratio of synonymous and non-synonymous substitutions. However, the applicability of this method is limited, as it cannot be applied to remotely related proteins when saturated substitutions occur. In addition, it requires the availability of DNA sequences, which are often difficulty to obtain in practice. Currently, no method is available that detects selection pressure based on this ratio using amino acid sequences alone. In the current work, we have developed an implicit codon model to infer positive selections directly form amino acid sequences at relative high accuracy level. Our method generates possible underlying DNA sequences from known protein sequences. For a given amino acid sequence set and a phylogenetic tree, a reversible continuous time Markov process was used as the evolutionary model at the amino acid level. Data set of β-globin sequences from vertebrates was used to verify our method. We are able to detect all of the eight residue sites known to experience positive selection, which were originally derived from native nucleotide sequences by Yang et al. For this test, our model is as effective as the traditional DNA sequence based method, but with the promise of much wider applicability. This work opens a new way to detect selection pressure by examining directly protein sequences, which are far easier to obtain than DNA sequences. Although the rate ratio of nonsynonymous and synonymous substitution is defined at codon level, our work showed that it can be generalized for detecting selective pressures using amino acid residues sequences by applying our Bayesian Monte Carlo method on an implicit codon-based model.

Acknowledgments

This work is supported by grants from NSF (CAREER DBI0133856), NIH (GM68958 and GM079804), and ONR (N000140310329). We thank Dr. Chung-I Wu for helpful discussion.

References

1.Bielawski J, Dunn K, Sabehi G, Beja O. Darwinian adaptation of proteorhodopsin to different light intensities in the marine environment. Proc Natl Acad Sci U S A. 2004;101:14 824–9. doi: 10.1073/pnas.0403999101. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Casari G, Sander C, Valencia A. A method to predict functional residues in proteins. Nat Struct Biol. 1995;2:171–8. doi: 10.1038/nsb0295-171. [DOI] [PubMed] [Google Scholar]
3.Felsenstein Evolutionary trees from dna sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–76. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
4.Felsenstein J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet. 1973;25:471–92. [PMC free article] [PubMed] [Google Scholar]
5.Huelsenbeck J, Dyer K. Bayesian estimation of positively selected sites. J Mol Evol. 2004;58:661–72. doi: 10.1007/s00239-004-2588-9. [DOI] [PubMed] [Google Scholar]
6.Kimura M. Cambridge University Press; Cambridge: 1983. The neutral theory of molecular evolution. [Google Scholar]
7.Li L, Shakhnovich E, Mirny L. Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases. Proc Natl Acad Sci U S A. 2003;100:4463–8. doi: 10.1073/pnas.0737647100. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Lio P, Goldman N. Models of molecular evolution and phylogeny. Genome Res. 1998;8:1233–44. doi: 10.1101/gr.8.12.1233. [DOI] [PubMed] [Google Scholar]
9.Massingham T, Goldman N. Detecting amino acid sites under positive selection and purifying selection. Genetics. 2005;169:1753–62. doi: 10.1534/genetics.104.032144. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Nielsen R, Huelsenbeck J. Detecting positively selected amino acid sites using posterior predictive P-values. Pac Symp Biocomput. 2002:576–88. doi: 10.1142/9789812799623_0054. [DOI] [PubMed] [Google Scholar]
11.Nielsen R, Yang Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics. 1998;148:929–36. doi: 10.1093/genetics/148.3.929. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Panchenko A, Kondrashov F, Bryant S. Prediction of functional sites by analysis of sequence and structure conservation. Protein Sci. 2004;13:884–92. doi: 10.1110/ps.03465504. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Pesole G, Attimonelli M, Liuni S. A backtranslation method based on codon usage strategy. Nucleic Acids Res. 1988;16:1715–28. doi: 10.1093/nar/16.5.1715. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Podlaha O, Zhang J. Positive selection on protein-length in the evolution of a primate sperm ion channel. Proc Natl Acad Sci U S A. 2003;100:12 241–6. doi: 10.1073/pnas.2033555100. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Pupko T, Sharan R, Hasegawa M, Shamir R, Graur D. A chemical-distance-based test for positive darwinian selection; WABI ’01: Proceedings of the First International Workshop on Algorithms in Bioinformatics; 2001. pp. 142–155. [Google Scholar]
16.Robert CP, Casella G. Monte carlo statistical methods. Springer-Verlag Inc; New York: 2004. [Google Scholar]
17.Sainudiin R, Wong W, Yogeeswaran K, Nasrallah J, Yang Z, Nielsen R. Detecting site-specific physicochemical selective pressures: applications to the Class I HLA of the human major histocompatibility complex and the SRK of the plant sporophytic self-incompatibility system. J Mol Evol. 2005;60:315–26. doi: 10.1007/s00239-004-0153-1. [DOI] [PubMed] [Google Scholar]
18.Sawyer S, Wu L, Emerman M, Malik H. Positive selection of primate TRIM5alpha identifies a critical species-specific retroviral restriction domain. Proc Natl Acad Sci U S A. 2005;102:2832–7. doi: 10.1073/pnas.0409853102. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Suzuki Y, Gojobori T. A method for detecting positive selection at single amino acid sites. Mol Biol Evol. 1999;16:1315–28. doi: 10.1093/oxfordjournals.molbev.a026042. [DOI] [PubMed] [Google Scholar]
20.Tseng Y, Liang J. Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a Bayesian Monte Carlo approach. Mol Biol Evol. 2006;23:421–36. doi: 10.1093/molbev/msj048. [DOI] [PubMed] [Google Scholar]
21.Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. 2001;18:691–9. doi: 10.1093/oxfordjournals.molbev.a003851. [DOI] [PubMed] [Google Scholar]
22.Yang Z. The power of phylogenetic comparison in revealing protein function. Proc Natl Acad Sci U S A. 2005;102:3179–80. doi: 10.1073/pnas.0500371102. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Yang Z, Nielsen R, Goldman N, Pedersen A. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000;155:431–49. doi: 10.1093/genetics/155.1.431. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Yang Z, Nielsen R, Hasegawa M. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol. 1998;15:1600–11. doi: 10.1093/oxfordjournals.molbev.a025888. [DOI] [PubMed] [Google Scholar]
25.Yang Z, Swanson W. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol Biol Evol. 2002;19:49–57. doi: 10.1093/oxfordjournals.molbev.a003981. [DOI] [PubMed] [Google Scholar]

[R1] 1.Bielawski J, Dunn K, Sabehi G, Beja O. Darwinian adaptation of proteorhodopsin to different light intensities in the marine environment. Proc Natl Acad Sci U S A. 2004;101:14 824–9. doi: 10.1073/pnas.0403999101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Casari G, Sander C, Valencia A. A method to predict functional residues in proteins. Nat Struct Biol. 1995;2:171–8. doi: 10.1038/nsb0295-171. [DOI] [PubMed] [Google Scholar]

[R3] 3.Felsenstein Evolutionary trees from dna sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–76. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]

[R4] 4.Felsenstein J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet. 1973;25:471–92. [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Huelsenbeck J, Dyer K. Bayesian estimation of positively selected sites. J Mol Evol. 2004;58:661–72. doi: 10.1007/s00239-004-2588-9. [DOI] [PubMed] [Google Scholar]

[R6] 6.Kimura M. Cambridge University Press; Cambridge: 1983. The neutral theory of molecular evolution. [Google Scholar]

[R7] 7.Li L, Shakhnovich E, Mirny L. Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases. Proc Natl Acad Sci U S A. 2003;100:4463–8. doi: 10.1073/pnas.0737647100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Lio P, Goldman N. Models of molecular evolution and phylogeny. Genome Res. 1998;8:1233–44. doi: 10.1101/gr.8.12.1233. [DOI] [PubMed] [Google Scholar]

[R9] 9.Massingham T, Goldman N. Detecting amino acid sites under positive selection and purifying selection. Genetics. 2005;169:1753–62. doi: 10.1534/genetics.104.032144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Nielsen R, Huelsenbeck J. Detecting positively selected amino acid sites using posterior predictive P-values. Pac Symp Biocomput. 2002:576–88. doi: 10.1142/9789812799623_0054. [DOI] [PubMed] [Google Scholar]

[R11] 11.Nielsen R, Yang Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics. 1998;148:929–36. doi: 10.1093/genetics/148.3.929. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Panchenko A, Kondrashov F, Bryant S. Prediction of functional sites by analysis of sequence and structure conservation. Protein Sci. 2004;13:884–92. doi: 10.1110/ps.03465504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Pesole G, Attimonelli M, Liuni S. A backtranslation method based on codon usage strategy. Nucleic Acids Res. 1988;16:1715–28. doi: 10.1093/nar/16.5.1715. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Podlaha O, Zhang J. Positive selection on protein-length in the evolution of a primate sperm ion channel. Proc Natl Acad Sci U S A. 2003;100:12 241–6. doi: 10.1073/pnas.2033555100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Pupko T, Sharan R, Hasegawa M, Shamir R, Graur D. A chemical-distance-based test for positive darwinian selection; WABI ’01: Proceedings of the First International Workshop on Algorithms in Bioinformatics; 2001. pp. 142–155. [Google Scholar]

[R16] 16.Robert CP, Casella G. Monte carlo statistical methods. Springer-Verlag Inc; New York: 2004. [Google Scholar]

[R17] 17.Sainudiin R, Wong W, Yogeeswaran K, Nasrallah J, Yang Z, Nielsen R. Detecting site-specific physicochemical selective pressures: applications to the Class I HLA of the human major histocompatibility complex and the SRK of the plant sporophytic self-incompatibility system. J Mol Evol. 2005;60:315–26. doi: 10.1007/s00239-004-0153-1. [DOI] [PubMed] [Google Scholar]

[R18] 18.Sawyer S, Wu L, Emerman M, Malik H. Positive selection of primate TRIM5alpha identifies a critical species-specific retroviral restriction domain. Proc Natl Acad Sci U S A. 2005;102:2832–7. doi: 10.1073/pnas.0409853102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Suzuki Y, Gojobori T. A method for detecting positive selection at single amino acid sites. Mol Biol Evol. 1999;16:1315–28. doi: 10.1093/oxfordjournals.molbev.a026042. [DOI] [PubMed] [Google Scholar]

[R20] 20.Tseng Y, Liang J. Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a Bayesian Monte Carlo approach. Mol Biol Evol. 2006;23:421–36. doi: 10.1093/molbev/msj048. [DOI] [PubMed] [Google Scholar]

[R21] 21.Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. 2001;18:691–9. doi: 10.1093/oxfordjournals.molbev.a003851. [DOI] [PubMed] [Google Scholar]

[R22] 22.Yang Z. The power of phylogenetic comparison in revealing protein function. Proc Natl Acad Sci U S A. 2005;102:3179–80. doi: 10.1073/pnas.0500371102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Yang Z, Nielsen R, Goldman N, Pedersen A. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000;155:431–49. doi: 10.1093/genetics/155.1.431. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Yang Z, Nielsen R, Hasegawa M. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol. 1998;15:1600–11. doi: 10.1093/oxfordjournals.molbev.a025888. [DOI] [PubMed] [Google Scholar]

[R25] 25.Yang Z, Swanson W. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol Biol Evol. 2002;19:49–57. doi: 10.1093/oxfordjournals.molbev.a003981. [DOI] [PubMed] [Google Scholar]

PERMALINK

Detecting Positively Selected Sites From Amino Acid Sequences: An Implicit Codon Model

Zheng Ouyang

Jie Liang

Abstract

I. INTRODUCTION