Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Feb 1.
Published in final edited form as: Proteins. 2017 Dec 12;86(2):218–228. doi: 10.1002/prot.25429

A New Parameter-Rich Structure-Aware Mechanistic Model for Amino Acid Substitution During Evolution

Peter B Chi 1,2, Dohyup Kim 3,4, Jason K Lai 3,5, Nadia Bykova 3,6, Claudia C Weber 1, Jan Kubelka 7, David A Liberles 1,3,*
PMCID: PMC5897152  NIHMSID: NIHMS923430  PMID: 29178386

Abstract

Improvements in the description of amino acid substitution are required to develop better pseudo-energy-based protein structure-aware models for use in phylogenetic studies. These models are used to characterize the probabilities of amino acid substitution and enable better simulation of protein sequences over a phylogeny. A better characterization of amino acid substitution probabilities in turn enables numerous downstream applications, like detecting positive selection, ancestral sequence reconstruction, and evolutionarily-motivated protein engineering. Many existing Markov models for amino acid substitution in molecular evolution disregard molecular structure and describe the amino acid substitution process during longer evolutionary periods poorly. Here, we present a new model upgraded with a site-specific parameterization of pseudo-energy terms in a coarse-grained force field, which describes local heterogeneity in physical constraints on amino acid substitution better than a previous pseudo-energy-based model with minimum cost in runtime. The importance of each weight term parameterization in characterizing underlying features of the site, including contact number, solvent accessibility, and secondary structural elements was evaluated, returning both expected and biologically reasonable relationships between model parameters. This results in the acceptance of proposed amino acid substitutions that more closely resemble those observed site-specific frequencies in gene family alignments. The modular site-specific pseudo-energy function is made available for download through the following website: https://liberles.cst.temple.edu/Software/CASS/index.html.

Keywords: coarse grained force field, protein evolution, mathematical model, macromolecular structure, sequence analysis, SH2 domain

Introduction

Amino acid substitution models underpin inference in many areas of bioinformatics and molecular evolution, including multiple sequence alignment, phylogenetic tree construction, ancestral sequence reconstruction, the detection of positive selection, and evolutionarily-motivated protein engineering [1]. Models are used to describe the probabilities of different types of substitutions becoming fixed in a population and observed as differences between species. Here, less probable changes result in larger estimated evolutionary distances between sequences. Many commonly used methods are structured as Markov models and assume that each site evolves independently and according to the same underlying process. One common variant is to use gamma distributed rates to account for site heterogeneity [2]. However, it has long been realized that protein structure and inter-atomic interactions within the structure lead to amino acid substitution probabilities that are dependent upon the sequence context at other sites [3] as well as population genetic parameters.

It is expected that site-independent Markov models of proteins will lead to a sampling of the equilibrium frequencies of the model at every site after a sufficiently long time, which is not how proteins actually evolve. In addition, whether an amino acid is acceptable at a given site depends on the states at other sites. To better account for interactions between sites, statistical potential models like Miyazawa-Jernigan (MJ) have been used to characterize amino acid substitution [3, 4]. The MJ matrix is used to evaluate substitution probabilities from a knowledge-based sum of pairwise contact energies between amino acids. However, this does not suffice to describe protein evolution. Not only does it ignore that distance between atoms in a protein is a continuous quantity and disregard angular information that is considered in common force field models for molecular mechanics (and instead uses a discrete contact or not scheme), but it also uses average effects for amino acid interactions that are not specific to the particular protein context. Several studies [57] have therefore sought to build fuller physical descriptions of atomic contacts into models for amino acid substitution, including the generation of the CASS simulation software using one of these models [8]. These models are still simplistic and a recent review pointed towards multiple avenues where the physical realism of proteins could be improved [9]. One particular problem addressed here is treating all sites with the same physical process model. This leads to potential problems in the inference of individual amino acid substitution probabilities, which can affect conclusions drawn from such evolutionary studies.

The use of more mechanistic models that better capture evolutionary and biophysical processes reflects an ongoing shift in moving from fitted codon models like Muse-Gaut [10] and Goldman-Yang [11] without explicit consideration of amino acid fixation probabilities to a newer class of mutation-selection models [1216] based upon the Halpern-Bruno framework [17]. These models include both a probability of introducing a mutation and a probability of fixing that introduced mutation in a population, based upon a vector of sitewise amino acid fitnesses or equilibrium frequencies. While current implementations of mutation-selection models do not consider protein structure or energetics in evaluating the probability of fixation of an amino acid, suitable structurally aware models of amino acid substitution would enable such a future implementation in this framework.

While more parameter rich models have typically been used for simulation rather than inference, there is an inherent relationship between the two. First, both simulation and inference require the evaluation of explicit probabilities of substitution. Second, approaches involving simulation can ultimately be used for inference with Approximate Bayesian Computation (ABC) [18].

In this work, a force field and corresponding fitness function that could be used for either sequence simulation or inference of amino acid substitution probabilities is used to characterize the fit for different amino acids into a given structure from energetic considerations, and sequence context is parameterized in a site-specific manner instead of a global manner. Rather than obtaining a single set of parameters for the entire protein, parameters are estimated for each site prior to the simulation such that the model remains computationally efficient and fast. The physical model upon which it is based explicitly considers the specific conformations of the protein backbone and the amino acid side chains in assessing sequence fit within a structure. An evolutionary characterization of the properties of this model is presented here, showing systematically generated single mutation amino acid sequences generated that are closer to the set of closely related SH2 domain homologs than those generated from a set of global weights. Sequences show both larger pseudo-energy gaps between native and non-native structures and are also more similar to known SH2 sequences.

Methods

As in Grahnen et al. [7], a two-bead model was used to represent each amino acid (shown in Figure 1). Thus, we do not explicitly model all atoms, but rather, we use one bead to represent the central carbon atom and a second bead to represent the center of the side chain atoms. Biophysical properties based on structure and interactions with neighboring beads are modeled with terms described in Grahnen et al. [7]. Briefly, the scoring function for protein stability includes seven terms: 1) a harmonic bending potential; 2) the van der Waals interaction term, as approximated by a sum of Lennard-Jones potentials; 3) entropic and steric effects of the backbone forming an alpha-helix; 4) a beta-sheet potential, similar to the alpha-helix potential; 5) an electrostatic potential based on Coulomb’s Law; 6) a potential due to solvation of the protein; and 7) an estimate of the stability due to formation of disulfide bonds. Together, these make our scoring function as follows:

V(s,c)=wbendVbend+wLJVLJ+whelixVhelix+wbetaVbeta+wionVion+wsolvVsolv+WSSVSS, (1)

where each Vx is the potential for each pseudo-energy component (x), and wx is the weight given to the corresponding term (under the restriction that the sum of all wx values is 1). The abbreviations used are a bond bending term (bend), a Lennard-Jones potential (LJ), helix and beta sheet propensities (helix and beta), ionic (Coulombic) potentials (ion), a solvation term (solv), and a disulfide bridge term (SS).

Figure 1.

Figure 1

The two-bead model that forms the basis of this analysis in representing side chains is shown. This graphic illustrates the angles and atomic elements that are used to summarize each amino acid. In particular, side chains are reduced to a single bead in this instance.

Of these terms, the alpha-helical, beta-sheet, electrostatic, and disulfide bridge potentials are allowed to have negative values reflecting an anti-propensity for a particular secondary structural element (formally an anti-anti-potential for the secondary structural propensities), charged contacts, or a cross-linking bridge; for the others (Lennard-Jones, solvation potential, and the bending potential), any values that are estimated to be negative initially are bounded by and set to 0 as negative values were less clearly physically justified. Each term in the force field has a particular physical meaning that is meant to characterize an interaction that contributes to a biological description of a folded protein. The precise mathematical form is more or less mechanistic depending upon the term. From the perspective of what each term is meant to fit, some terms have a priori justified selective pressures to prevent their appearance, which are parameterized as anti-propensities. For other terms, it is harder to envision selective pressures existing to prevent their appearance and, in these cases, anti-propensities were disallowed in the spirit of producing a mechanistically justified model. A recent perspectives piece argues for the use of mechanistically interpretable models, even when phenomenological models can result in better discriminators [19].

In contrast to the procedure outlined in Grahnen et al. [7], here we estimate site-specific wx values in each respective scoring function, as opposed to one set of weights for the entire protein. In other words, the original implementation of CASS estimates one global wx value for each term across all amino acid sites in the protein. In this work, we enhance the biological realism of the model by estimating specific wx values for each individual amino acid position, to allow each term to have a different contribution depending on the physical location of the site within the protein conformation. The scoring function is otherwise largely the same, with a few exceptions. First, we modify the alpha helical potential. In the original implementation of CASS, the alpha helical potential is calculated as:

Vhelix=i=3N3[12ki13(ri,i+2ρh)2+12ki14(ri,i+3ρh)2] (2)

In this equation, κi1–3 is the average helical propensity scaling factor for the residues at positions i, i+1 and i+2 (likewise, κi1–4 is the average across positions i, i+1, i+2 and i+3), adapted from Mukherjee and Bagchi [20], with values shown in Table 1 where r is the distance between the Cα beads in the corresponding residues and ρh is the equilibrium helix inter-bead distance (5.5 Å). While this formulation is reasonable for describing protein folding, we adapt it to make it more appropriate for assessing the effect of mutational changes in the amino acid sequences. Specifically, we replace the scaling factor with:

ki=kalanineHpi(kalaninekglycine), (3)

where κalanine and κglycine are force constants for alanine and glycine (which are the maximum and minimum among the force constants for all amino acids, respectively) and Hpi is the helix propensity value for the ith amino acid, from Pace and Scholtz [21]. Additionally, we consider the fact that helical propensities should only be evaluated for residues that are in alpha helices. Thus, we remove the distance term (r․ – ρh)2 because the distance differences within helices are relatively the same and close to 0. With these changes, our new alpha helix potential becomes:

Vhelix=i=3N3[12ki13+12ki14] (4)

using the new κi values as formulated in Equation (3).

Table 1.

Scale values for each amino acid used in the helical potential are shown. These are the values that are used to parameterize that aspect of the pseudo-energy function.

Amino Acid Ki (kJ mol−1 A−2)
Ala 17.20
Glu 14.45
Leu 13.59
Met 13.07
Arg 13.59
Lys 12.73
Gln 10.49
Ile 10.15
Asp 9.80
Ser 8.60
Trp 8.77
Tyr 8.08
Phe 7.91
Val 6.71
Thr 5.85
His 7.57
Cys 5.50
Asn 6.02
Gly 000
Pro −37.15

Additionally, we modify the beta sheet potential. In the original formulation of CASS, the beta sheet potential is given by:

Vbeta=λi14i=2N2(Cb(φiφb))2 (5)

where (φi − φb) is the difference in the torsion angle between the equilibrium beta sheet value (210°) and the actual torsion angle at residue i, λi1–4 is the average beta propensity of residues i-1, i, i+1 and i+2, and Cb is a scaling constant. Here, we remove the torsion angle difference since only residues within beta sheets are evaluated.

A set of close homologous SH2 sequences that are expected to have similar selective pressures on structure was generated to assign weight terms and subsequently evaluate mutational effects. The set of non-identical SH2 sequences was generated from BLAST starting with the sequence from PDB 1D4T against Genbank [36] and Pfam [37]. The non-identical SH2 sequences with greater than 90% identity and lengths that were neither 50% longer or shorter than the starting sequence were retained. This resulted in a set of 99 homologs.

To estimate the weight terms (wx), the pseudo-energies (Vx) are first calculated as above for a given protein, its homologs, and its decoys. Then, the unweighted pseudo-energies of both the protein homologs and decoys were normalized such that the term scores were within a comparable range. We next calculate the weights as the scaled difference between the pseudo-energy mean of the protein homologs distribution and the pseudo-energy mean of the decoys distribution. Here, sites and terms in which the pseudo-energies are different between the protein and decoy distribution will carry a larger weight, whereas sites and terms in which they are similar will carry a smaller weight. The set of weights for any given site are scaled to sum to 1 with a maximum absolute value of 1 for any given weight, since negative propensities are allowed for certain terms as previously discussed.

We also improved our decoy generation, which reflects the background distribution of contacts in the unfolded, misfolded, and alternatively folded states. In the original implementation of CASS, decoys are generated at random by shuffling the geometrical measurements in the native structure according to the Random Energy Model proposed by Bryngelson and Wolynes [22]. Here, we consider the distribution of the number and length of alpha helices and beta sheets, obtained from proteins with similar lengths in PDB. This information is then applied when generating decoy sets, by randomly assigning secondary structures with probabilities corresponding to their relative frequencies in similar length proteins in the Protein Data Bank [23]. Side chain positions were optimized upon mutation using SARA [24].

To examine the effects of the site-specific parameterization on the evolutionary simulation framework, acceptance probabilities were determined for every amino acid position of the input protein. These probabilities were computed from the deterministic amino acid frequencies of every residue and represent the likelihood a single mutant would be accepted during the simulation. The frequencies were obtained by calculating the pseudo-energy score change upon every possible mutation at a given site (i.e. substitution of each of the twenty amino acids at that specific position in the sequence). Under the assumption that the described pseudo-energy score change follows a Boltzmann distribution, the deterministic acceptance probabilities were calculated and displayed on a sequence logo to illustrate the favorability of the model to accept specific single substitutions.

Statistical analyses were performed using R, version 3.2.5 [25], and the packages broom, dplyr and tidyr for data manipulation and exportation [2628]. Variance estimates for weight terms were obtained via bootstrapping, where the unit of replication was the entire sequence and structure (homolog or decoy). Multivariate regression was performed using the package car [29], in which the collection of weights for each of the seven terms in the scoring function was used as the multivariate outcome. In this analysis, the observational unit is each site of the protein, and we investigated whether certain properties of each site were correlated with the weights.

Specifically, we investigated the SH2 domain of the signaling protein with PDB ID 1D4T, and its native sequence, characteristics of contact number (the number of residues within 6.5 Å), predicted secondary structure, rate factors, and solvent-accessible surface area at each site. Secondary structure was predicted using PSIPRED [30] in single sequence mode, in which each site was predicted as to whether it was in an α-helix, β-sheet, or coil. As a control, estimation of secondary structure directly from the 1D4T PDB file using DSSP [31, 32] yielded very similar site-specific characterizations. To estimate rate factors (relative amino acid substitution rate) at each site, we used BppML [33] to first estimate the shape parameter of the discrete gamma distribution with four categories under the empirical amino acid model of Le and Gascuel [34]. Using this, we then calculated the likelihood of the tree for each column with each of the four discrete rate factors (treating gaps as ambiguous). On the basis of these likelihoods, each column was assigned a normalized probability of the column belonging to a given rate factor bin, and used that to get the weighted mean across all bins, resulting in a weighted mean across all bins, which then gives estimated rate factors at each site that follow a continuous distribution of values bounded at approximately the lowest and highest discrete rate factors. To estimate solvent-accessible surface area, we used DSSP [31, 32] to estimate raw scores, which were then normalized according to the procedure and normalization values described in Tien et al. [35].

Results

To parameterize a new site-specific substitution model and address its ability to predict observed amino acid substitutions, a structure that is the target of selection was differentiated from a set of decoy structures using sequences known to fold into the target structure. In this particular case, the SH2 domain of the signaling protein with PDB ID 1D4T was used considering both its native sequence and a collection of close homolog sequences as the target structure of selection for a functional protein. We investigated the distribution of putative site-specific weights across the length of the protein, as estimated by our method described above, to create an approach that better describes amino acid substitution than an approach that uses the same set of weights for all sites. With a background distribution estimated from 9,900 decoy states, the set of site-specific weights is chosen to proportionally represent the difference in the energy function between the native vs. the decoy states based upon the underlying term differences, generating a physical description of amino acid sequences from the model that fold stably into the native state as compared with the decoy states. It should be noted that the sequence itself is not being fit, but rather parameters that describe the physical interactions at each site. With the x-axes representing location (i.e. residue number), the profile of the putative weight at each site for each term is shown in Figure 2. We compare the profile under an implementation where no anti-propensities were allowed, and the current regime where some anti-propensities were allowed as described in the Methods section. For both alpha-helix and beta-sheet propensities, the putative weights tended to be higher for residues that are in those respective secondary structures. A similar trend is observed for the charge propensity, where sites in which conserved amino acids have charged side chains give rise to larger weight values in our estimation scheme. When no anti-propensities were allowed, the disulfide bridge term was estimated to be zero for the entire protein. This is due to the original native conformation having no disulfide bonds; likewise, the homolog sequences have few cysteine residues, making the probability of any decoy state having novel disulfide interactions very small. Indeed, in our 9,900 decoy structures, not a single disulfide bond existed. When some anti-propensities were allowed, we estimate some non-zero weights for the disulfide bridge term, but none that are above 1e-03. All together, considerable variation in weights across sites was observed, leading to a very different pseudo-energy function than that produced with a common set of weights. While the values for the disulfide bridge term are small, this term was retained in the function in generating a mechanistic approach that is broadly applicable to gene families other than SH2 domains.

Figure 2.

Figure 2

The optimal weight of each term is shown across each site. When allowed to vary, the weights show a high level of variability across sites. There are two parameterizations shown. In one, only positive weight values at each site are allowed, while in the other, some terms are allowed to have negative weight values or anti-propensities, when such a parameterization is a priori physically justified.

In the investigation of the SH2 family, in which we obtained 99 homologous sequences from Genbank [36] and PFam [37], we compared the resulting specificity of each weighting scheme, with respect to the pseudo-energy gap between native sequences and decoy sequences. In the top-left panel of Figure 3, the distributions of the overall energy score from the 99 homologs are shown, both for site-specific weights and whole protein weights; in the bottom-left panel, the background distributions for each method are shown. We observe that the pseudo-energy gap is much larger when using site-specific weights than when using whole protein weights, as shown by comparing the medians of each distribution in the boxplots shown in the right panel. The boxplots show the distribution (obtained via bootstrapping of the pseudo-energy scores themselves) of the difference in the median of the pseudo-energy score of the homologs compared to the median of the pseudo-energy score of the decoys. The much larger pseudo-energy gap indicates a model that better discriminates between the native state and alternative conformations. In general, the more distant the homologs are that still fold into the same structure with the same folding constraints, the more widely applicable the approach is expected to be.

Figure 3.

Figure 3

The distribution of pseudo-energy scores using global weights vs. site-specific weights is shown for comparison. Boxplots show the five-number summaries of distributions of the difference in medians obtained by bootstrapping, where with the addition of site-specific weight parameters, a much larger discrimination between sequences known to fold into SH2 domains in that fold are compared to their scores in alternative samplings of conformational space (decoys).

In Figure 4, we examine an additional measure of the discrimination attained by the use of site-specific weights in evaluating the fit of mutants generated systematically across the sequence length. Random samples of single-mutant sequences from the original native sequences of an SH2 domain protein are generated, and the relative frequencies of each mutant are represented by the size of the corresponding letter in the WebLogo plot of Figure 4. Figure 4A shows the distribution of sequences after selection based on parameterizations of the scoring function with whole protein weights, while Figure 4B shows the same using site-specific weights. The amino acid frequencies of both were compared to Figure 4C, which shows a sample of curated homologs to the SH2 domain. A Chi-square test statistic, summed across all sites, was used as a measure of how different the distribution of amino acids are. The Chi-square value comparing the sample using site-specific weights to the sample of SH2 homologs is 59717.19, and the Chi-square value comparing the sample using global weights to the sample of SH2 homologs is 60362.96. Although the difference between these two values is not statistically significant, the fact that the Chi-square value is smaller under site-specific weights suggests that evolution under site-specific weights tends to give rise to a distribution of amino acids that is more similar to the distribution of amino acids in actual proteins. Overall, this suggests that modeling protein fitness using site-specific weights gives a closer approximation to the evolutionary process that created the homologs shown in Figure 4C.

Figure 4.

Figure 4

A WebLogo plot shows greater similarity between sequences evolved under site-specific weights compared to sequences evolved under global weights. A). A logo plot obtained from exhaustively sampling proposed single mutants with the global weight set. B). A logo plot obtained from exhaustively sampling proposed single mutants with the site-specific weight set. C). A logo plot obtained from the set of phylogenetically filtered homologs in the SAP SH2 domain subfamily, reflecting the sequences that were used to generate the weight sets.

Figure 4B shows more sites under constraint and stronger constraint at sites that show constraint. In particular, the strongest constraint in Figure 4A is mostly in selecting hydrophobic residues, including aromatic amino acids, with similar distributions at the sites with the most constraint. In Figure 4B, this is more varied across sites, with clear preferences for charged, aliphatic, and aromatic residues at specific sites. Some differences in amino acid preferences are noted in hydrophobic sites, also with some of the strongest positively charged residues in Figure 4B not showing strong constraint in Figure 4A.

To explore this in more detail, given the lack of statistical significance from the chi-squared distribution, we asked how many positions were well described for different amino acid categories. These putative categories were: 1) aliphatic (for amino acids A, V, L, I, C, M, G); 2) aromatic (for amino acids Y, W, F); 3) charged polar (for amino acids H, K, R, D, E); 4) non-charged polar (S, T, N, Q, P). This categorization was based on Table 2–4 from [38]. As can be seen in Figure 7, the site specific weights improved the description of both aromatic and charged polar terms, while not improving the description of aliphatic amino acid positions. This analysis shows where the model is providing fold-specific information on amino acid substitution and where it needs further improvement, particularly for non-charged polar residues.

Figure 7.

Figure 7

The comparison of global and site-specific weights in correctly describing positions that were dominated by specific amino acid types is shown. As described in the methods, the set of aliphatic amino acids is [A, V, L, I, C, M, G], the set of aromatic amino acids is [Y, W, F], the set of charged polar amino acids is [H, K, R, D, E], and the set of non-charged polar amino acids is [S, T, N, Q, P]. The third column in each set reflects the number of amino acid sites in each category in the alignment that a model performing well would characterize.

To investigate what drives the parameterization of our site-specific terms, we performed a multivariate regression with the collection of weights as the outcome variables, and using data from the SH2 domain on rate factors, contact number, predicted secondary structure, and solvent-accessible surface area at each site as the predictor variables. As seen in Table 2, statistically significant associations are found between the overall collection of weights vs. contact number (p=0.0203), secondary structure (p=0.0054) and solvent-accessible surface area (p=6.289e-5). As post-hoc analyses, we analyzed the relationship between the same predictor variables and each weight individually, shown in Table 3. In these regressions, one of our interests is in determining if any of the weights differentiate fast vs. slowly evolving sites, as represented by the rate factor term. At the nominal α=0.05 level, the rate factor term was not beneath this for any of the weights, but was closest to it for the Lennard-Jones term (p=0.0701) and the α-helix term (p=0.0581). This suggested that, as expected, residues that were constrained by van der Waals contact energies tended to be more slowly evolving and that alpha helical residues tended to be faster evolving. Residues with beta potentials were negatively correlated with both SASA (the residue solvent accessible surface area) and contact number terms for the residues. The expected relationship between solvent accessible surface area and contact number [39, 40] was observed, given that buried residues have more contacts with other amino acids. Because the weights can work in both directions as a correlation and anti-correlation, the sign on the correlation reflects the use of the weight (wsolv) rather than the actual SASA at the site. Charge and SASA were also observed to be correlated, as expected.

Table 2.

Regression analysis with the collection of all weights for each term as the multivariate outcome variable is shown. This analysis relates various structural components listed with the weight values that were parameterized.

F-STATISTIC NUM DF DENOM DF P-VALUE
RATE FACTOR 1.4123 6 93 0.2182
CONTACT # 2.6502 6 93 0.0203
2° STRUCT 2.4534 12 188 0.0054
SASA 5.5208 6 93 6.289e-5

Table 3.

Post-hoc regression analyses with each weight as the outcome variable. The terms ssE and ssH are for the predicted secondary structure, which was a categorical variable with three values in our data: H (α-helix), E (β-sheet), and C (coil). Thus, coil was taken to be the baseline value; the p-value is for the overall categorical variable, obtained by comparing the regression model with the secondary structure terms and without. Estimates represent the expected difference in the respective outcome variable corresponding to a 1-unit increase in each term, while holding each other term constant. Negative estimates indicate an inverse relationship between the outcome variable and that term.

MODEL TERM ESTIMATE STD.ERROR STATISTIC P-VALUE
BENDING SASA 0.0617 0.0447 1.38 0.171
ratefactor 0.0175 0.0116 1.52 0.133
contacts 0.000697 0.00552 0.126 0.9
ssE 0.0242 0.0161 1.51 0.294
ssH 0.00516 0.0195 0.265

LJ SASA −0.0519 0.107 −0.485 0.629
ratefactor −0.0506 0.0276 −1.83 0.0701
contacts 0.0342 0.0132 2.59 0.011
ssE 0.0107 0.0385 0.278 0.89
ssH −0.0107 0.0466 −0.229

ALPHA SASA −0.0968 0.147 −0.659 0.512
ratefactor 0.0728 0.038 1.92 0.0581
contacts −0.021 0.0181 −1.16 0.251
ssE −0.0806 0.0529 −1.52 0.00808
ssH 0.116 0.064 1.81

BETA SASA −0.488 0.128 −3.81 0.000247
ratefactor 0.000515 0.0331 0.0155 0.988
contacts −0.0454 0.0158 −2.87 0.00502
ssE 0.0684 0.0462 1.48 0.00725
ssH −0.105 0.0559 −1.89

CHARGE SASA 0.0309 0.0148 2.1 0.0387
ratefactor 7e-04 0.00381 0.184 0.855
contacts 0.00199 0.00182 1.09 0.277
ssE −0.0138 0.00531 −2.59 0.0386
ssH −0.00785 0.00643 −1.22

SASA SASA 0.544 0.111 4.92 3.46e-06
ratefactor −0.0409 0.0286 −1.43 0.155
contacts 0.0295 0.0136 2.16 0.0331
ssE −0.00907 0.0398 −0.228 0.959
ssH 0.00279 0.0482 0.0579

DISULFIDE SASA −6.53e-05 6.64e-05 −0.984 0.328
ratefactor 4.5e-06 1.72e-05 0.262 0.794
contacts 5.38e-06 8.19e-06 0.657 0.513
ssE 7.83e-06 2.39e-05 0.327 0.587
ssH −2.11e-05 2.89e-05 −0.73

To quantify the precision of our weight estimates, we estimated the variance of their sampling distributions via bootstrapping the homolog and decoy sequences. As the weights for any given site are anti-correlated (i.e. they must sum to 1), bootstrapping preserves this correlation structure. Shown in Figure 5 are the estimated variances of the sampling distribution of the weight estimate for each term across all sites. All of the distributions of the variances are right-skewed with a high proportion of extremely small variances, indicating that in general the weights are estimated with high precision. This is further summarized in Figure 6, where we show, for each site, the average variance across all terms. Again, most of the sites have extremely small variances leading to small means, but a few sites show slightly higher means as illustrated in the figure.

Figure 5.

Figure 5

Histograms of the variances of the weights for each term and at each site show right-skewed distributions with high proportions of the variances being extremely small, indicating that the precision of weight estimates is generally high. Variances were obtained via bootstrapping with the unit of replication being the entire sequence and structure of the homologs and decoy structures.

Figure 6.

Figure 6

The landscape of means of the variances across terms is shown across all sites of the protein. The variance by alignment position indicates that no particular region of the protein is particularly poorly described.

Discussion

While our analyses here only consider protein folding, future studies will investigate protein-protein binding as well. Under an analogous site-specific scoring function with terms specific to protein-protein binding interactions, we will be able to better describe amino acid substitution patterns and their contributions to fitness with respect to these interactions. As currently existing common evolutionary substitution models do not contain any energy-based structural information, they are fundamentally unable to account for loss or retention of interactions due to any amino acid substitutions. We expect that our model will thus be a better predictor of substitution patterns for amino acids that reside in the interaction interfaces.

There is some concern regarding the high number of additional parameters that our site-specific regime induces on the model. It is reasonable to wonder if our model is over-fit. While an ideal statistical justification of the number of parameters may be possible by formulating our model in a likelihood framework, it is not trivial to formulate such a likelihood in our setting. Future work will address this issue, but until then, we are unable to use traditional model fitting assessments that rely on likelihood comparisons. The parameters are added at the level of weight determination which operates on changes in pseudoenergies rather than likelihoods. They do not directly contribute to the description of sequence changes, where likelihoods are well established.

The manner in which we select decoys is an important consideration as well. Currently, decoys are generated by randomizing contact maps. Although the distributions of the number and length of alpha helices and beta sheets are set to follow the respective distributions in proteins of similar length in PDB and is an improvement over an unweighted randomization scheme as previously mentioned, this results in implicit decoys. An alternative approach would be to use explicit decoys, although there these are more limited in their representation of structural alternatives than intrinsic decoys.

Additionally, the optimal approach for choosing the set of weights for each site is an issue that we aim to address in future work. Instead of using a background distribution based on decoys (either implicit or explicit), another approach would be to estimate the weights based on differences at the sequence level. Here, we would consider collections of known sequences, selected such that their phylogenetic relationship resembles a star tree, and that fold into a given conformation. We would then compare these with random sequences of the same length that are not expected to fold into that conformation, by threading these sequences through that conformation, and evaluating the scoring function for each group of sequences. Weights would then be optimized to maximize the difference between these two groups of sequences. It may be possible that this manner of estimating weights will result in better performance of the scoring function. It should be noted that the terms used are unweighted, resulting in potentially equal contributions to the global pseudo-energy from each site. This attribute of the model can be relaxed in future work.

Using structural decoys for a set of sequences has a basis in the protein folding problem in finding physical differentiators for sequences that are known to fold into SH2 domains between their native fold and structures they are known not to fold into. Here, the native sequences are threaded through both native and non-native structures to evaluate relative fit of a sequence in different structures. Using sequence decoys for the set of homologs evaluated in the SH2 domain structure has a basis in the inverse folding problem in finding physical differentiators for sequences that are known to fold into an SH2 structure and those that are not expected to in the context of the folded structure. Here, native and random sequences are threaded through the native structure to evaluate relative fit of different sequences in a single structure. While these considerations are distinct, both are logically justifiable and future work will be needed to establish which works better with an evolutionary objective function.

Lastly, this work provides modular code for download that enables generation of these weights. While this code is not implemented in a phylogenetic framework for either simulation or inference, future work is envisioned that will enable this functionality. There are few evolutionary models for sequence change that consider protein structure in evaluating the probabilities of amino acid change and this work reflects an improvement to the CASS software for sequence simulation, one of the only evolutionary sequence simulators with available software.

Conclusion

Models for amino acid substitution that are intended to characterize probabilities of mutation between amino acids and subsequent fixation do not incorporate information about protein structure and position context and therefore do not predict substitution probabilities particularly well. It has been a goal to generate better models for amino acid substitution that are protein structure aware. The first generation of structure-aware models did not perform particularly well. Here, the performance of a simple structural model was improved by introducing a site-specific parameterization that fit the underlying pseudo-energies that generate the substitution rather than the substitutions themselves, as well as modifying the terms of the pseudo-energy calculation. This work reflects an important step that underpins the development of better protein structure aware models and applications for use in molecular evolutionary and comparative genomic (phylogenetic) studies towards predicting changes in protein structure and function over short sequence divergences under selective pressure.

Acknowledgments

This work was partly supported by NSF grant DBI-1355846 to DAL. The project described was also partly supported by an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under Grant # 2P20GM103432 to University of Wyoming.

References

  • 1.Anisimova M, Liberles DA, Philippe H, Provan J, Pupko T, von Haeseler A. State-of the art methodologies dictate new standards for phylogenetic analysis. BMC Evol. Biol. 2013;13:161. doi: 10.1186/1471-2148-13-161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methds. J. Mol. Evol. 1994;39:306–314. doi: 10.1007/BF00160154. [DOI] [PubMed] [Google Scholar]
  • 3.Pollock DD, Thiltgen G, Goldstein RA. Amino acid coevolution induces an evolutionary Stokes shift. Proc. Natl. Acad. Sci. USA. 2012;109:E1352–9. doi: 10.1073/pnas.1120084109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Miyazawa S, Jernigan RL. Estimation of effective interresidue contact energies from protein crystal structures: quazi-chemical approximation. Macromolecules. 1985;18:534–552. [Google Scholar]
  • 5.Bastolla U, Vendruscolo M, Knapp E. A statistical mechanical method to optimize energy functions for protein folding. Proc. Natl. Acad. Sci USA. 2000;97:3977–3981. doi: 10.1073/pnas.97.8.3977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kleinman CL, Rodrigue N, Lartillot N, Philippe H. Statistical potentials for improved structurally constrained evolutionary models. Mol. Biol. Evol. 2010;27:1546–1560. doi: 10.1093/molbev/msq047. [DOI] [PubMed] [Google Scholar]
  • 7.Grahnen JA, Nandakumar P, Kubelka J, Liberles DA. Biophysical and structural considerations for protein sequence evolution. BMC Evol. Biol. 2011;11:361. doi: 10.1186/1471-2148-11-361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Grahnen JA, Liberles DA. CASS: Protein sequence simulation with explicit genotype-phenotype mapping. Trends Evol. Biol. 2012;4:e9. [Google Scholar]
  • 9.Chi PB, Liberles DA. Selection on protein structure, interaction, and sequence. Protein Sci. 2016;25:1168–1178. doi: 10.1002/pro.2886. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 1994;11:715–724. doi: 10.1093/oxfordjournals.molbev.a040152. [DOI] [PubMed] [Google Scholar]
  • 11.Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 1994;11:725–736. doi: 10.1093/oxfordjournals.molbev.a040153. [DOI] [PubMed] [Google Scholar]
  • 12.Rodrigue N, Philippe H, Lartillot N. Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proc. Natl. Acad. Sci. USA. 2010;107:4629–4634. doi: 10.1073/pnas.0910915107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Tamuri AU, dos Reis M, Goldstein RA. Estimating the distribution of selection coefficients from phylogenetic data using sitewise mutation-selection models. Genetics. 2012;190:1101–1115. doi: 10.1534/genetics.111.136432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Meyer AG, Wilke CO. Integrating sequence variation and protein structure to identify sites under selection. Mol. Bio. Evol. 2013;30:36–44. doi: 10.1093/molbev/mss217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Tamuri AU, Goldman N, dos Reis M. A penalized-likelihood method to estimate the distribution of selection coefficients from phylogenetic data. Genetics. 2014;197:257–271. doi: 10.1534/genetics.114.162263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Rodrigue N, Lartillot N. Detecting adaptation in protein-coding genes using a Bayesian site-heterogeneous mutation-selection codon substitution model. Mol. Biol. Evol. 2017;34:204–214. doi: 10.1093/molbev/msw220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Halpern AL, Bruno WJ. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol. Biol. Evol. 1998;15:910–917. doi: 10.1093/oxfordjournals.molbev.a025995. [DOI] [PubMed] [Google Scholar]
  • 18.Tavaré S, Balding DJ, Griffiths RC, Donnelly P. Inferring coalescence times from DNA sequence data. Genetics. 1997;145:505–518. doi: 10.1093/genetics/145.2.505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Liberles DA, Teufel AI, Liu L, Stadler T. On the need for mechanistic models in computational genomics and metagenomics. Genome Biol Evol. 2013;5:2008–2018. doi: 10.1093/gbe/evt151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Mukherjee A, Bagchi B. Correlation between rate of folding, energy landscape, and topology in the folding of a model protein HP-36. J. Chem. Phys. 2003;118:4733–4747. [Google Scholar]
  • 21.Pace CN, Scholtz JM. A helix propensity scale based on experimental studies of peptides and proteins. Biophys. J. 1998;75:422–427. doi: 10.1016/s0006-3495(98)77529-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Bryngelson JD, Wolynes PG. Spin glasses and the statistical mechanics of protein folding. Proc. Natl. Acad. Sci. USA. 1987;84:7524–7528. doi: 10.1073/pnas.84.21.7524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Rose PW, Prlić A, Altunkaya A, Bi C, Bradley AR, Christie CH, Di Costanzo L, Duarte JM, Dutta S, Feng Z, Kramber Green R, Goodsell DS, Hudson B, Kalro T, Lowe R, Peisach E, Randle C, Rose AS, Shao C, Tao Y-P, Valasatava Y, Voigt M, Westbrook JD, Woo J, Yang H, Young JY, Zardecki C, Berman HM, Burley SK. The RSCB protein data bank: integrative view of protein, gene and 3D structural information. Nucl. Acids. Res. 2017;45:D271–D281. doi: 10.1093/nar/gkw1000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Grahnen JA, Kubelka J, Liberles DA. Fast side chain replacement in proteins using a coarse-grained approach for evaluating the effects of mutation during evolution. J. Mol. Evol. 2011;73:23–33. doi: 10.1007/s00239-011-9454-3. [DOI] [PubMed] [Google Scholar]
  • 25.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2016 https://www.R-project.org/
  • 26.Robinson D. broom: Convert Statistical Analysis Objects into Tidy Data Frames. R package version 0.4.2. 2017 https://CRAN.R-project.org/package=broom.
  • 27.Wickham H, Francois R. dplyr: A Grammar of Data Manipulation. R package version 0.5.0. 2016 https://CRAN.R-project.org/package=dplyr.
  • 28.Wickham H. tidyr: Easily Tidy Data with 'spread()' and 'gather()' Functions. R package version 0.6.1. 2017 https://CRAN.R-project.org/package=tidyr.
  • 29.Fox J, Weisberg S. An R Companion to Applied Regression. Second. Thousand Oaks CA: Sage; 2011. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion. [Google Scholar]
  • 30.Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 1999;292:195–202. doi: 10.1006/jmbi.1999.3091. [DOI] [PubMed] [Google Scholar]
  • 31.Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
  • 32.Touw WG, Baakman C, Black J, te Beek TAH, Krieger E, Joosten RP, Vriend G. A series of PDB-related databanks for everyday needs. Nucl. Acids. Res. 2015;43:D364–D368. doi: 10.1093/nar/gku1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Dutheil J, Boussau B. Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs. BMC Evol. Biol. 2008;8:255. doi: 10.1186/1471-2148-8-255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Le SQ, Gascuel O. An improved general amino acid replacement matrix. Mol. Biol. Evol. 2008;25:1307–1320. doi: 10.1093/molbev/msn067. [DOI] [PubMed] [Google Scholar]
  • 35.Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum allowed solvent accessibilities of residues in proteins. PLoS ONE. 2013;8:e80635. doi: 10.1371/journal.pone.0080635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucl. Acids. Res. 2017;45:D37–D42. doi: 10.1093/nar/gkw1070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A. The Pfam protein families database: towards a more sustainable future. Nucl. Acids. Res. 2016;44:D279–85. doi: 10.1093/nar/gkv1344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Cantor CR, Schimmel PR. Biophysical Chemistry. New York: W.H. Freeman and Company; 1980. [Google Scholar]
  • 39.Echave J, Wilke CO. Biophysical Models of Protein Evolution: Understanding the Patterns of Evolutionary Sequence Divergence. Annu. Rev. Biophys. 2017 doi: 10.1146/annurev-biophys-070816-033819. [Epub ahead of print] [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Zhang J, Yang JR. Determinants of the rate of protein sequence evolution. Nat. Rev. Genet. 2015;16:409–20. doi: 10.1038/nrg3950. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES