Substitution rate variability among sites is a common feature of protein evolution. This variability is frequently modeled by a gamma (Γ) distribution (1), which is proposed to be an emergent property of epistasis by de la Paz et al. (2). However, the conclusion of de la Paz et al. (2) is based on the analysis of highly divergent sequences, as many as 24 substitutions per site. We examined if this conclusion holds for more biologically realistic sequence divergences (0.1 to 5 substitutions per site) (3). Instead, we found that creation of a class of invariant sites is an emergent property of epistasis, and the sequence divergence level dictates whether an equal- or a variable-rate model will best fit the data.
In de la Paz et al. (2), neutral molecular evolution was simulated for protein families with a fitness function defined by a model inferred via direct coupling analysis (e.g., Fig. 1A). We counted the number of sites that experienced zero, one, two, three, and more substitutions directly from the generation by generation simulated sequences (e.g., Fig. 1B), which avoids confounding effects of estimation errors. This site frequency spectrum (SFS) will follow a Poisson distribution when the data are compatible with a single (S) evolutionary rate across sites (4). SFS will follow a negative binomial distribution if the evolutionary rates among sites are Γ distributed (1). We tested model fits to SFS at different levels of sequence divergence, allowing for a category of invariant sites (I+) with S and Γ models (5). For each dataset, the best-fit model from four candidates (S, I + S, Γ, and I + Γ models) was selected using corrected Akaike information criterion (AICc) because these models are not all nested (6, 7).
Fig. 1.
Impact of epistasis on models of rates of neutral molecular evolution. (A) Coupling coefficients between 268 sites in the PF00001 family of protein domains. For each pair of sites, the Frobenius norm of coupling coefficients for all residue combinations is shown with darker, hotter colors representing a stronger overall coupling between a pair of sites. (B) Distribution of the number of substitutions for sites in one dataset of PF00001 when the expectation was two substitutions per site. The I + S model (blue) has a higher log likelihood than a Γ model (green). In the I + S model, the “invariant” class contains ∼20% of the sites (black hatched). (C) The fraction of PF00001 datasets in which a given model fits the best is determined using AICc (6, 7). The mean difference in AICc between the Γ model and the best-fitting model is shown in parentheses below the x axis. Each bar corresponds to the analysis of datasets with a given evolutionary divergence, which is in units of the expected number of substitutions per site. Each evolutionary divergence analysis is composed of 50 datasets generated from the first 10,000 generations of simulated data provided by de la Paz et al. (2), with the first dataset sampled after sequence Hamiltonians reached an equilibrium state (figure S1 in ref. 2) at generation 2,000. Subsequent datasets were sampled with the starting generations increasing by 50 (i.e., step size = 50 generations). I + S and I + Γ models were tested through the corresponding zero-inflated Poisson and negative binomial models, respectively. The unmutated sites during simulation were not considered invariant, as they were not subjected to evolutionary pressures.
The S model fits best for a vast majority of datasets at sequence divergences less than 0.5 substitutions per site (Fig. 1C). That is, epistasis did not create significant evolutionary rate variability among sites at low sequence divergences. For divergences larger than 0.5 substitutions per site, models allowing for an invariant class of sites (I + S and I + Γ) provide the best fits (Fig. 1C). Similar patterns are observed for the other nine protein domain families analyzed by de la Paz et al. (2) (Fig. 2). The Γ model alone was not the dominant model except for one case (Fig. 2).
Fig. 2.
Best-fitting models by sequence divergence for all protein families. The fraction of datasets where a single-rate model (S), variable-rate model (Γ), and models with an invariant class (“I+”; I + S and I + Γ) best fit the observed SFS (e.g., Fig. 1B). The numbers of amino acid (AA) sites simulated and analyzed in each protein family are shown.
Therefore, at sufficient evolutionary distances, epistasis generates many invariant sites and substitution rate variability, likely because incompatible substitutions at coupled sites are under greater negative selection than substitutions at loosely coupled and uncoupled sites. At relatively low divergences, fewer coupled sites are mutated, likely resulting in less differential negative selection among sites. So, even a single-rate model may fit the data well. Overall, our results suggest that epistasis provides a mechanistic explanation for the abundance of invariant sites beyond what is explained by an equal (S)- or a variable (Γ)-rate model. Consequently, models incorporating a separate class of invariant sites best describe evolutionary rates of proteins.
Acknowledgments
This work was supported by NSF Grant 934848 and NIH Grant GM139540.
Footnotes
The authors declare no competing interest.
References
- 1.Uzzell T., Corbin K. W., Fitting discrete probability distributions to evolutionary events. Science 172, 1089–1096 (1971). [DOI] [PubMed] [Google Scholar]
- 2.de la Paz J. A., Nartey C. M., Yuvaraj M., Morcos F., Epistatic contributions promote the unification of incompatible models of neutral molecular evolution. Proc. Natl. Acad. Sci. U.S.A. 117, 5873–5882 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Luz H., Vingron M., Family specific rates of protein evolution. Bioinformatics 22, 1166–1171 (2006). [DOI] [PubMed] [Google Scholar]
- 4.Zuckerkandl E., Pauling L., “Evolutionary divergence and convergence in proteins” in Evolving Genes and Proteins, Bryson V., Vogel H. J., Eds. (Academic Press, 1965), pp. 97–166. [Google Scholar]
- 5.Gu X., Fu Y. X., Li W. H., Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12, 546–557 (1995). [DOI] [PubMed] [Google Scholar]
- 6.Cavanaugh J. E., Unifying the derivations for the Akaike and corrected Akaike information criteria. Stat. Probab. Lett. 33, 201–208 (1997). [Google Scholar]
- 7.Posada D., Buckley T. R., Model selection and model averaging in phylogenetics: Advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Syst. Biol. 53, 793–808 (2004). [DOI] [PubMed] [Google Scholar]


