Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2021 Mar 12;16(3):e0248337. doi: 10.1371/journal.pone.0248337

Extra base hits: Widespread empirical support for instantaneous multiple-nucleotide changes

Alexander G Lucaci 1,#, Sadie R Wisotsky 1,#, Stephen D Shank 1,1, Steven Weaver 1, Sergei L Kosakovsky Pond 1,*
Editor: Marc Robinson-Rechavi2
PMCID: PMC7954308  PMID: 33711070

Abstract

Despite many attempts to introduce evolutionary models that permit substitutions to instantly alter more than one nucleotide in a codon, the prevailing wisdom remains that such changes are rare and generally negligible or are reflective of non-biological artifacts, such as alignment errors. Codon models continue to posit that only single nucleotide change have non-zero rates. Here, we develop and test a simple hierarchy of codon-substitution models with non-zero evolutionary rates for only one-nucleotide (1H), one- and two-nucleotide (2H), or any (3H) codon substitutions. Using over 42, 000 empirical alignments, we find widespread statistical support for multiple hits: 61% of alignments prefer models with 2H allowed, and 23%—with 3H allowed. Analyses of simulated data suggest that these results are not likely to be due to simple artifacts such as model misspecification or alignment errors. Further modeling reveals that synonymous codon island jumping among codons encoding serine, especially along short branches, contributes significantly to this 3H signal. While serine codons were prominently involved in multiple-hit substitutions, there were other common exchanges contributing to better model fit. It appears that a small subset of sites in most alignments have unusual evolutionary dynamics not well explained by existing model formalisms, and that commonly estimated quantities, such as dN/dS ratios may be biased by model misspecification. Our findings highlight the need for continued evaluation of assumptions underlying workhorse evolutionary models and subsequent evolutionary inference techniques. We provide a software implementation for evolutionary biologists to assess the potential impact of extra base hits in their data in the HyPhy package and in the Datamonkey.org server.

Introduction

Most modern codon models in widespread use assume that any changes within a codon happen as a sequence of single instantaneous nucleotide changes, enforced by setting instantaneous rates between codons that differ in more than one nucleotides to zero. This choice was made independently for the mechanistic models of Muse and Gaut [1] and Goldman and Yang [2], and adopted by subsequent model developers and model users. For example, when Halpern and Bruno [3] introduced their mutation-selection models, they considered the general multi-hit (MH) case first, but then largely abandoned it, noting that the single hit reduction “..has very little effect on our results under the conditions we have investigated.” This assumption is both computationally convenient and biologically sound in the majority of cases, since randomly occurring mutations “hitting” the same codon is a negligibly rare event. While these events are indeed rare, evidence for substitutions occurring in tandem at adjacent nucleotide sites had been reported at about the same time the codon models were being introduced [4].

Averof et al [5] reported significant rates of changes between TCN and AGY codon islands in perfectly conserved serine residues, and argued against going through intermediary non-synonymous changes due to their likely deleterious effects. Rogozin et al [6] took the opposite view, namely that strong purifying selection on single nucleotide changes is a more plausible explanation for such island hops in general.

Neither of those studies had considered an explicit evolutionary model, however. Serine is the only amino-acid with synonymous codon islands in the universal genetic code, but several other codes have other amino acids with this property: leucine in the Chlorophycean and Scenedesmus obliquus mitochondrial codes (TAG and CTH), and alanine in the Pachysolen tannophilus nuclear code (CTG and GCH).

Recent studies estimated that 2% of nucleotide substitutions are part of larger multiple nucleotide changes that occur simultaneously [7, 8], due in part to an error-prone DNA polymerase zeta. Human germline tandem mutations may constitute up to 0.4% of all mutations [9], and individual cases of such mutations have significant phenotypic consequences, e.g., via their effects on protein folding [10].

A number of codon model extensions have incorporated MH, invariably finding improvement in fit and (if the model allowed testing) statistically significant evidence of non-zero rates involving multiple nucleotide changes. Kosiol et al [11] developed a general MH empirical codon substitution model estimated jointly from a large collection of training alignments, and noted that it was overwhelmingly preferred to standard SH models on a sample of biological data from the PANDIT database.

Several groups have independently created alternative codon model parameterizations to allow for MH, including Whelan and Goldman [12] (“… these events [MH] are far more prevalent than previously thought”), Zaheri et al [13], and Dunn et al [14] (the latter two studies show a dramatically better model fit to empirical alignments when allowing MH). Other studies have used evolutionary models with varying degrees of accommodation for multiple hits [1518]. Jones et al [19] implemented a complex model to detect adaptive evolution with a discrete-state phenotype, allowing for double and triple mutations to be absorbed into the parameter estimates. Despite multiple introductions to the field, these models have been unable to gain attraction in applied evolutionary analyses, and for some of these methods, software implementing them is no longer available.

Failure to include multiple hits in codon substitution models may mislead evolutionary hypothesis testing. Venkat et al [20] found that the addition of a double-hit rate parameter improved model fit and impacted branch-specific inferences of positive selection (MH along short branches can inflate false positives). Dunn et al [14] used principled simulation studies to show that fitting 1H models to data generated with low rates of multiple hits can increase false positive rates and dilute power for identifying individual sites subject to positive selection.

In this study, we develop simple extensions to the Muse-Gaut [1] codon model which add double, and triple instantaneous (2H, 3H, respectively) changes and compares them to simpler models in large collections of empirical data. Our models are mechanistic and simpler than those proposed by Whelan and Goldman [12] and Dunn et al [14]. This relative simplicity allows our models to be implemented and fitted quickly, and offers straightforward interpretation, including the ability to identify individual sites that benefit from the addition of MH. The primary goals of our data analyses is to establish how often evidence for multiple hits can be detected in large-scale empirical databases (something that no other study looking at evolutionary models has done), identify the codons that are frequently involved in such events, and explore plausible biological explanations for why these rates are non-zero for a majority of alignments.

Materials and methods

Substitution models

The most general model considered here is the 3H+ substitution model and all others can be derived from it as special cases (see Table 1 for key parameter notation). The model is a straightforward extension of the Muse-Gaut style of time-reversible, continuous Markov processes model [1]. In this study we compare five models:

Table 1. Key parameters of our models.

Estimation strategies for each parameter for the five different models are shown.

Parameter Description Model
1H 2H 3HSI 3H 3H+
ωi Site-specific dN/dS ratio Random effect 3-bin GDD distribution
δ Global 2H/1H rate ratio 0 Estimated Estimated Estimated Estimated
ψs Global 3H/1H rate ratio for synonymous codon islands 0 Estimated Estimated = ψ Estimated
ψ Global 3H/1H rate ratio 0 0 0 Estimated Estimated

Abbreviations: GDD = general discrete distribution; 1H, 2H, 3H = instantaneous changes involving one, two, or three nucleotides.

  • 1H

    is the standard Muse-Gaut style model which only permits single nucleotides to substitute instantaneously. Non-synonymous changes occur at rate ω (relative to the synonymous rate), and this rate varies from site to site according to a three-bin general discrete distribution (GDD).

  • 2H

    is the 1H model extended to allow two nucleotides in a codon to substitute instantaneously with rate δ (relative to 1H synonymous rate).

  • 3HSI

    is the 2H model extended to allow three nucleotides in a codon to substitute instantaneously if the change is synonymous (e.g., serine islands), with relative rate ψs.

  • 3H

    is the 2H model extended to also permit any three-nucleotide substitutions, with relative rate ψ.

  • 3H+

    is the 3HSI model extended to also permit any three-nucleotide substitutions, with relative rate ψ.

All codon substitutions in these models fall into one of six categories defined by (i) whether or not they are synonymous or non-synonymous, and (ii) by how many nucleotides are being replaced (1,2,3). The instantaneous rate expression for substitutions between codons i and j (ij) for these six classes, and how many of all possible codon substitutions are in each class, are shown in Table 2. In addition to key model parameters defined in Table 1, the model contains a number of other, standard parameters, which are not the main focus of inference and can be viewed, for the most part, as nuisance parameters. They include θij: nucleotide-level biases coming from the general time reversible model (5 parameters), and πj are codon-position specific nucleotide frequencies estimated from counts using the CF3x4 procedure [21]. ωk are non-synonymous / synonymous rate ratios which vary from site to site using a random effect (D-bin general discrete distribution, D = 3 by default, 2D − 1 parameters). The key parameters are global relative rates of multiple hit substitutions: δ is the rate for 2H substitutions relative to the synonymous 1H rate (baseline), ψ—the relative rate for non-synonymous 3H substitutions, and ψs—the relative rate for synonymous 3H substitutions. All parameters, except π, including branch lengths are fitted using directly optimized phylogenetic likelihood in HyPhy [22]. Initial estimates for branch lengths and θ are obtained using the standard nucleotide general time reversible model. Following this initialization, models are fitted in the order of increasing complexity (1H, then 2H, then 3HSI, then 3H+), using parameter estimates from from each stage as initial points for the next stage.

Table 2. Expressions for different types of susbtitutions in the model rate matrix.

Type Expression for qij Example Count
Universal mtDNA
1H synonymous θijπj ACAACT:θCTπT3 134 128
1H non-synonymous ωkθij πj AAAAGA:ωkθAGπG2 392 380
2H synonymous δn=12θijnπjn CTCTTA:δθCTθACπT1πA3 28 16
2H non-synonymous δωkn=12θijnπjn AAAACC:δωkθACθACπC2πC3 1540 1500
3H synonymous ψsn=13θijnπjn AGCTCA:ψsθATθCGθACπT1πC2πA3 12 12
3H non-synonymous ψωkn=13θijnπjn GTGTAC:ψωkθGTθATθCGπT1πA2πC3 1554 1504

Six cases for instantaneous rates qij of substituting codon i with codon j (ij). The “Count” columns shows the number of rate matrix entires in each class (excluding the diagonal) for two commonly used genetic codes.

Site-level support for MH

In order to identify which individual sites show preference for MH models, we use evidence ratios (ER), defined as the ratio of site likelihoods under two models being compared, e.g., ER(2H:1H)L(si|2H)L(si|1H). We previously showed that ER are useful for identifying the sites driving support for one model over another [23], and they incur trivial additional overhead to compute once model fits have been performed.

Hypothesis testing

Nested models are compared using likelihood ratio tests with χd2 asymptotic distribution used to assess significance. The degrees of freedom (d) parameter is as follows: d = 1 for 2H:1H, 3SHI:2H, and 3H+:3HSI comparisons; d = 2 for 3H+:2H comparison; d = 3 for 3H+:1H comparison.

How do our models relate to previously published multiple-hit models?

The BS+MNM model [20] was designed for testing subsets of branches for episodic diversifying selection. It is very similar to our 2H model, except that θij in their model follows the HKY85 parameterization, it is possible to allow κ (transition/transversion ratio) to be different between 1H and 2H changes, and target codon frequencies are used in qij [2]. The Empirical Codon Model or ECM [11] directly estimates numerical rates for all pairs of codon exchanges in the GY94 frequency framework from a large training dataset. The SDT model [12] uses a context-averaging approach to include the effect of substitutions that span codon boundaries, and is difficult to directly relate to our models; the 3H model might be the closest to the SDT model. Regrettably, there does not seem to exist a working implementation of the SDT model (pers. comm from Simon Whelan), which makes direct comparison to our approaches impractical. The KCM model [13] only has a single rate for multiple hits (double or triple), and has position-specific nucleotide substitution rates (θ in our notation), so it would be most comparable to the 3H model with δ = ψ. The GPP model class [14] can be parametrized to recapitulate our models because it can capture (in a log-linear parametric form) arbitrary rate matrices with suitable parametric complexity. Several of the models in the GPP class include multiple hits, but they are not directly comparable to ours, mostly because they also incorporate ω rates that depend on physicochemical properties of amino acids, and because the exact parametric form of the models are hard to glean from available description.

Empirical data

The Moretti et al, 2014 (Selectome) data collection consists of 13,714 gene alignments from the Euteleostomi clade of Bony Vertebrates from Version 6 of the database [24] and can be downloaded from data.hyphy.org/web/busteds/. The Shultz et al data collection [25] contains 11,262 orthologous protein coding genes from 39 different species of birds and is freely available at https://datadryad.org/stash/dataset/doi:10.5061/dryad.kt24554. The Enard et al data collection [26] includes 9,861 orthologous coding sequence alignments of 24 mammalian species and is available at https://datadryad.org/stash/dataset/doi:10.5061/dryad.fs756. Our mtDNA dataset consists of both invertebrate and vertebrate Metazoan orders with mitochondrial gene alignments. This dataset was originally published in Mannino et al [27], and can be found at https://github.com/srwis/variancebound.

Simulated data

We generated simulated alignments of two sequences in HyPhy using the SimulateMG94 package from https://github.com/veg/hyphy-analyses/. These alignments were simulated under the 1H (no site-to-site rate variation) with varying sequence and branch lengths as well as varied but constant ω across sites but no multiple hits. We created 1000 simulations scenarios to capture a range of important model parameters and drew 5 replicates per scenario. ω was drawn uniformly from U(0.01, 2.0), branch length was drawn Exp(U(0.01, 1.0)), and codon lengths as an integer from 100 to 50000 uniformly. Parameter values were sampled using the Latin Hypercube approach to improve parameter space coverage.

Multiple sequence simulations were based on the fits to one of four benchmark datasets: Drosophila adh, Hepatitis D antigen, HIV vif, and the Vertebrate rhodopsin data. We took all model parameters estimated under the 3H+ model as the starting point, and generated 500 replicates per dataset of which 35% were null (1H), 10% each from 2H, 3SHI or restricted 3H+ (ψs = 0), and 35% from 3H+. δ, ψ and ψs parameters, when allowed to be non-zero by the model, were sampled from U(0, 1), U(0, 1), and U(0, 10), respectively.

Sequences with indel rate variation were generated using INDELible v1.03 [28]. We varied indel rates between 0.01 to 0.06 in increments of 0.005 (100 replicates per value), and the modeled site-to-site rate variation a 3-bin M3 model.

Implementation

All analyses were performed in HyPhy version 2.5.1 or later [22]. The fmm (FitMultihitModel) module used to fit the standard 1H model along with 2H, 3H and 3HSI versions is available from: https://github.com/veg/hyphy-analyses/, and is a part of the standard library (invoked with hyphy fmm) in HyPhy version 2.5.7 or later and also in our datamonkey.org server [29]. The result of an fmm analysis is a JSON file which be visualized using a web-application at http://vision.hyphy.org/multihit.

Results

Benchmark alignments

We introduce the models using a collection of thirteen representative alignments Table 3 that we and others have recently used to benchmark selection analyses [30]. We also include a primate lysozyme alignment originally analyzed with early codon models by Yang [31]. We consider five models (see Table 1 and the methods section for details), which form a nested hierarchy (with the exception of 3HSI and 3H which are not nested), each with one additional alignment-wide parameter.

Table 3. Analysis of 13 benchmark datasets for evidence of multi-nucleotide substitutions.

Gene N S T δ ψs ψ LRT p-value # sites with ER > 5
2H:1H 3H+:1H 3H+:2H 3H+:3HSI 3HSI:2H 2H:1H 3H+:2H
β-globin 17 144 2.5 0.7 (0.81) >100 0 <0.001 <0.001 <0.001 1 <0.001 10 6
Flavivirus NS5 18 342 6 0.49 (0.73) 2.3 0.6 <0.001 <0.001 0.056 0.062 0.13 16 0
Primate Lysozyme 19 130 0.24 0 (0) 0 0 1 1 1 1 1 0 0
COXI 21 510 5.3 0.4 (0.4) 0 0 <0.001 0.0018 1 0.94 0.98 3 0
Drosophila adh 23 254 1.4 0.31 (0.4) 0 0.42 <0.001 <0.001 0.19 0.067 0.99 4 0
Encephalitis env 23 500 0.84 0.076 (0.076) 0 0 0.19 0.42 1 1 0.98 0 0
Sperm lysin 25 134 2.8 0.4 (0.46) 2.3 0.3 <0.001 <0.001 0.04 0.015 0.49 21 1
HIV-1 vif 29 192 0.96 0.007 (0.044) 0 0.17 0.058 0.0013 0.0077 0.0018 0.95 0 2
Hepatitis D virus antigen 33 196 1.9 0.34 (0.37) 0 0.2 <0.001 <0.001 0.25 0.098 0.99 15 0
Vertebrate Rhodopsin 38 330 3.9 0.54 (0.72) 9.2 0.9 <0.001 <0.001 <0.001 <0.001 0.0029 43 3
Camelid VHH 212 96 15 0.29 (0.32) 0 0.13 <0.001 <0.001 0.011 0.0026 0.92 46 0
Influenza A virus HA 349 329 1.4 0.06 (0.06) 0 0.0093 <0.001 <0.001 0.95 0.74 0.98 5 0
HIV-1 RT 476 335 6.6 0.086 (0.093) 0 0.048 <0.001 <0.001 0.15 0.052 1 17 1

N—number of sequences, S—number of codons, T—total tree length (expected subs/site) under the 1H model, two-hit (δ) rate estimate under the 3H model (2H model in parentheses), there-hit synonymous island date (ψs) estimate under the 3H model, three-hit rate (ψ) estimate under the 3H model. Likelihood ratio test p-values for all pairs of nested models e.g. 2H:1H—2H alternative, 1H null. Values <0.05 are bolded. # sites with ER > 5 lists the number of sites which show strong preferences for 2H or 3H model using evidence ratios of at least 5 (see text).

  1. Evidence for multiple hits is pervasive. In ten of thirteen datasets the analyses strongly reject the hypothesis that 2H have zero rates, with p < 0.001 (2H:1H comparison). For five of thirteen datasets, we can further reject the hypothesis that 3H have zero rates (3H+:2H comparison) at p ≤ 0.05.

  2. Varied patterns for rate preferences. Even in this small collection of datasets, the entire spectrum of options is present. For the Primate Lysozyme dataset there is no evidence for anything other than 1H changes, while for the Vertebrate Rhodopsin dataset each of the individual rates is significantly different from 0. HIV-1 vif dataset is the only dataset that does not support 2H rates, but does support 3H rates. Five datasets share a pattern: reject 1H in favor of 2H, and 1H in favor of 3H+, but none of the others, which can be interpreted as support for 2H rates, but none of the 3H rates.

  3. Varied extent of site-level support for MH. Ratios between site-level likelihoods under individual models, denoted here as ER (evidence ratios), can indicate which model provides better fit to the data at a particular site. The number of sites with strong (ER > 5) preference for 2H vs 1H model was positive for all models rejecting 1H in favor of 2H with LRT, and ranged from 3 to 46, while a smaller number of sites (0 − 6) preferred 3H+ to 1H. Interestingly, for Camelid VHH, where the LRT rejects 1H in favor of 3H+, no individual sites had ER > 5, implying that the support for this model came from a number of individual weak site contributions.

  4. Interaction between 1H, 2H and 3H rates. Assuming that the biological process of evolution does include MH events, not modeling them appropriately might have the effect of inflating other rate estimates.

    In line with other studies [14], the addition of 2H rates lowers the point estimates of ω rates for all datasets where 2H:1H comparison is significant at p ≤ 0.05 (S1 Table), sometimes dramatically (e.g., by a factor of 0.6× for the β−globin gene). This could be indicative of estimation bias due to model misspecification. Similarly, the δ rate under the 2H model is always higher than the rate estimate under the 3H+ model, implying that the 2H rate may be “absorbing” some of the 3H variation. We will later see the same pattern emerge in large-scale sequence screens.

To bolster our intuitive understanding of model preferences, we visualized inferred substitutions at four archetypal sites in the Vertebrate Rhodopsin alignment [32], for which every single rate in the 3H+ model was significantly non-zero (Fig 1). We used joint maximum likelihood ancestral state reconstruction [33] under the 3H+ model to estimate the number and kind of substitutions that occurred at each site (this number is a lower bound and is subject to estimation uncertainty; here we use it for illustration purposes).

Fig 1. Archetypal sites based on model preferences.

Fig 1

Four alignment sites from the Vertebrate Rhodopsin dataset [32] chosen to illustrate substitution patterns which give rise to support for specific rate models. Branches are colored by the amino-acid that is observed/estimated to exist at the end of the branch. Internal nodes are labeled with ancestral states inferred under the 3H+ model. Evidence ratios, which are the ratios of MLE site likelihoods under the respective models, for four pairwise model comparisons are listed below each site.

  • Single-hit site. Site 37 is what one might call a traditional single-hit substitution site, where the 1H model is preferred to all other models based on ER values; all apparent substitutions involve changes at a single nucleotide, hence the standard 1H is perfectly adequate. Of 330 codons, 149 had a preference for the 1H model compared to the 2H model.

  • Two-hit site. Site 144 has a dramatic preference for the 2H model over the 1H model (ER > 300); of 6 total substitutions, 4 involved a change at 2 nucleotides (and none—at 3).

  • Serine island site. Site 281 has a preference for the 3HSI model over the 2H model (ER = 39), and has a complex substitution pattern: nine 1H, four 2H, and two 3H substitutions; both 3H substitutions at this site involve synonymous changes between serine codon islands (TCN and AGY). 148 other sites had a preference (ER > 1) for 3HSI over 2H.

  • Three-hit site. Site 236 prefers 3H to 3HSI (ER = 5.4) as the only apparent 3H substitution at that site does not involve serine.

Large-scale empirical analyses

We fitted the hierarchy of MH models to 42, 498 empirical datasets, assembled from three large-scale studies of natural selection of nuclear genes [2426], and a smaller collection vertebrate and invertebrate mitochondrial genes [27], which represent a different evolutionary landscape (e.g., not affected by polymerase zeta).

Strong evidence for non-zero multiple-hit rates

We found widespread statistical support for models that include non-zero rates involving multiple nucleotides. The 1H model was overwhelmingly rejected in favor of the 2H model (Table 4), and the improvement in fit was quite dramatic on average, for all but the Enard et al [26] collection. A substantial fraction of alignments preferred models that allowed non-zero three-hit rates over the two-hit model, and also the 3H+ model which does not limit 3H instantaneous changes to only synonymous codons. Based on the results of the four likelihood ratio tests, each dataset could be assigned to a unique rate preference category Fig 2. For example, 11, 899 alignments preferred 2H to 1H model, but none of the other comparisons were significant, i.e there was no evidence for non-zero 3H instantaneous rates. 2, 675 alignments preferred 2H to 1H, and 3H+ to 2H, i.e. provided evidence for non-zero 3H instantaneous rates. 483 alignments preferred 2H to 1H and 3HSI to 2H, but not 3H+ to 3HSI, implying that all 3H changes were constrained to synonymous codon islands.

Table 4. Evidence for multiple hit rates in empirical datasets.
Dataset 2H:1H 3H+:2H 3H:1H 3H+:3HSI 3HSI:2H
Invertebrate mtDNA 92% (119.2) 7.4% (17.12) 92% (122.2) 8.9% (19.97) 2.3% (9.089)
Vertebrate mtDNA 54% (33.30) 3.0% (16.60) 50% (36.92) 3.2% (15.11) 0.69% (7.986)
Shultz et al (2009) 62% (32.39) 20% (17.76) 63% (39.87) 21% (13.63) 7.4% (12.84)
Moretti et al (2014) Unmasked 76% (55.99) 37% (21.82) 77% (67.67) 20% (13.56) 29% (16.73)
Moretti et al (2014) Masked 76% (67.35) 32% (27.3) 74% (83.23) 17% (18.24) 23% (20.92)
Enard et al (2006) 28% (15.69) 5.4% (14.18) 28% (20.39) 5.3% (10.49) 3.4% (11.07)
Overall 61.02% (53.99) 23.07% (19.13) 60.79% (61.71) 15.66% (15.17) 15.01% (13.11)

For each collection of alignments, the table shows the fraction with significant (p < 0.01, based on a 5-way conservative Bonferroni correction for FWER of 5%) LRT test results, and the average value of the likelihood ratio test statistic (for significant tests) in parentheses. Masked vs unmasked refers to the two versions of data in Moretti et al (2014): alignments in the masked version have some low quality sites removed.

Fig 2. Intersections of likelihood ratio test significance.

Fig 2

Overlaps of empirical alignments with p ≤ 0.01 according to each of four LRTs performed for the combined empirical datasets. Groups of alignments for which a particular combination of tests was significant are shown in the main chart. The auxiliary chart in the lower left shows the number alignments belonging to a particular comparison category (row “sums”), with the significant tests indicated with filled dots. For example, there are 1537 alignments where all 4 tests are significant, and 136 alignments where the only significant test is 3SHI:2H.

Factors associated with MH detection

The rates at which 2H, 3H and 3HSI rates were detected with p < 0.01 as functions of simple statistics of the alignments, are shown in Fig 3. Larger (more sequences) and longer (more codons) alignments generally elicited higher detection rates for all types of multiple hits. Increasing overall divergence levels between sequences, measured by the total tree length, also corresponded to increasing detection rates, up to a saturation point. The mean strength of selection, measured by the gene-average ω had little effect on detection rates, except for a dip in the tail. In a simple logistic regression using 2H:1H p < 0.01 as the outcome variable, sequence length, and number of sequences were positively associated with the detection rate (p < 0.0001), while tree length was confounded with the number of sequences and was not independently predictive, and ω was not significantly predictive.

Fig 3. Multiple hit detection rate.

Fig 3

The fraction of alignments where the corresponding test was significant at p ≤ 0.01 as a function of one of four alignment properties. Orange circles depict the binning steps and the number of alignments falling into each bin. For tree lengths and ω values we used estimates under the 1H model.

Strong MH signal comes from a small fraction of sites

For alignments where there was significant evidence for non-zero 2H and/or 3H rates (p < 0.01), a small fraction of sites strongly (ER > 5) supported the corresponding MH model. For the 2H:1H comparison, a median of 0.67% (interquartile range, IQR [0.21% − 1.7%]), and for the 3H:2H comparison, a median of 0.52% (IQR [0.26% − 0.94%]) (S2 Fig) sites in an alignment had high ER in support of the respective model.

Patterns of substitution associated with MH rates

Substitutions between serine islands (AGY and TCN) appear to be the most frequent inferred 3H change in biological alignments (see Fig 4). Six of the most common substitutions at sites with high ER in support of the 3H+ model involve island jumping. However, other amino-acid pairs are also involved in hundreds of apparent substitutions, e.g. ATG(M)↔GCA(A). Of the 7664 datasets that reject the 2H model in favor of the general 3H+ model, 2901(37.9%) fail to reject 3HSI in favor of 3H+, implying that they only require non-zero rates for synonymous island jumps. However, many of the same changes frequently appear at sites that do not strongly prefer 3H+ to 2H model, but strongly prefer 2H to 1H model (i.e, 2H sites). A key determinant of whether or not an AGY:TCN or other 3H change benefits from non-zero ψ rates is the length of the branch where the change is inferred to occur. Branches with 3H changes that supported 3H+ model were significantly shorter than those where 2H model was sufficient: median 0.09 substitutions/site, vs median 0.26 substitutions/site. Consequently, the need to explain 3H changes happening over short branches (shorter evolutionary time, slower overall rates) provides evidence in support of 3H+ models.

Fig 4. Three-hit substitutions commonly occurring in empirical data.

Fig 4

A subset of common three-hit substitutions across all empirical datasets. Three-hit substitutions with 3H+ support are defined as those occurring at sites with ER(3H+:2H)>5. Three-hit substitutions with 2H but not 3H+ support are defined as those occurring at sites with ER(3H+:2H)<1 and ER(2H: 1H)>5. Branch lengths along which the two types of substitutions are inferred to occur are shown in the histogram.

Among 3H non-synonymous substitutions (see Fig 5) codons encoding for serine are still prominently represented, but not as dominant, with numerous substitutions involving methionine and other amino-acids.

Fig 5. Three-hit non-synonymous substitutions and two-hit substitutions occurring in empirical data.

Fig 5

A subset of common substitutions across all empirical datasets. Three-hit substitutions with non-synonymous support are defined as those occurring at sites with ER(3H+:3HSI)>5. Two-hit substitutions over short branches are defined as those occurring at sites with ER(3H+:2H)<1 and ER(2H: 1H)>10 and branch length is ≤0.05 subs/site. Two-hit substitutions over short branches are defined as those occurring at sites with ER(3H+:2H)<1 and ER(2H: 1H)>10 and branch length is ≥0.25 subs/site.

Serine codons are similarly frequently involved in 2H substitutions, along both short and long branches (e.g. between codons such as AGCTCC and AGTTCT), but other pairs are exchanged at least 90 times, including ACA(T)↔ATG(M) and CAG(Q)↔TGG(W) (short branches) and ATT(I)↔TTA(L) (long branches).

Interaction between rate estimates

As with the benchmark datasets, the inclusion of multiple hit rates in models has an effect on other substitution rates. The gene wide point-estimate of ω is systematically lowered by the inclusion of non-zero δ rates, even though there are rare instances when the ω estimates are increased (S1 Fig). A Theil Sen robust linear regression estimate yields ω(2H) ∼ 0.965 × ω(1H), but for 1150(5.7%) of the datasets with where 2H:1H comparison was signifiant, the ω(2H) < 0.75 × ω(1H). Consequently the estimation bias in important evolutionary rates due to model misspecification for some of the datasets could be significant. The inclusion of 3H components in the model lowers the 2H rate as well δ(3H+) ∼ 0.77 × δ(2H).

Impact on branch length estimates

Branch lengths estimated with 2H and 3H models on the Selectome dataset were, on average, 0.93× the length of the standard (1H) estimate, while they were effectively identical between 2H and 3H estimates (S5 Fig). On data simulated without MH, all three models (1H, 2H, 3H) yielded branch length estimates that were nearly identical, and a slight underestimate of the true values due to a bias in estimating the length of very short branches. However, on data simulated with MH, 1H models overestimate branch lengths compared to the 2H/3H models. A plausible explanation is that 1H models “expand” branch lengths slightly to compensate for the multi-hit events, while 2H and 3H models do not introduce bias in branch length estimates when there is no underlying multi-hit component because they can account for this situation by setting some parameters to 0.

Simulations

False positive rates

We evaluated operating characteristics of the likelihood ratio tests (LRT) for MH model testing on parametrically simulated data. In the simplest case of a single-branch (two-sequence) null data generated under the 1H model, Type I error rates for 2H:1H and 3H:2H tests were on average below nominal. However, for very divergent sequences (e.g., >3 expected substitution per site), the test became somewhat anti-conservative, which is not surprising for such severely saturated data (S3 Fig). Individual branches that are this long are highly abnormal in real biological datasets. Expanded to multiple sequence alignments generated using parameter estimates from four biological datasets, simulations confirmed that all the tests employed appear to be somewhat conservative; this is by design because asymptotic distributions of LRT statistics when null hypotheses are on the boundaries of the parameter space are less conservative than the 1- or 2-degrees of freedom χd2 distributions we use here [34].

Power

The tests are generally well-powered, especially if the effect sizes (magnitudes of MH rates) are sufficiently large (Table 5). The power to detect two-hit substitution (2H:1H) is especially high (>90%) across all simulations. The test which attempts to identify non-zero triple-hit synonymous island rates (3HSI:2H) is the least powerful, because its signal is derived from a tiny fraction of all substitutions (substitutions between synonymous islands), i.e. the effective sample size is smaller that for the other tests.

Table 5. Power to detect MH rates.
Test All Large effect
N Power N Power
2H:1H 1956 94% 967 99%(δ > 0.5)
3H+:2H 1940 64% 1056 83%(ψ > 0.5)
3HSI:2H 447 33% 114 51%(ψs > 5.0)
3H+:3HSI 1940 66% 1056 86%(ψ > 0.5)

The fractions of simulated datasets that had p < 0.05 for the corresponding test. N = number of simulations in each category, and the explicit definition effect size is shown in parentheses.

False positives due to alignment errors

Whelan and Goldman [12] suggested that non-zero estimates of triple-hit rates could be at least partially attributed to alignment errors. It is impossible, with a few rare exceptions, to declare that any particular alignment of biological sequences is correct. Hence, in order to estimate what, if any, effect potential multiple sequence alignment errors might have on our inference, we simulated null data (1H model) with varying indel rates with Indelible [28], inferred multiple sequence alignments MAFFT [35] in a codon-aware fashion, inferred trees using neighor-joining, and performed our hierarchical model fit. This procedure induces multiple levels of model misspecification, and errors: Indelible uses a different model (GY94 M3) to simulate sequences, there is alignment error, and there is phylogeny inference error. Sufficiently high indel rates coupled with other inference errors can indeed bias our tests to become anti-conservative, although these levels are higher than what we see (based on per-sequence “gap”/character) ratios for our biological alignments (S4 Fig). Empirical alignments have gap content that is consistent with alignments simulated with 0.01 − 0.015 indel rates, for which test performance is nominal. The Selectome dataset [24] can also be retrieved masked for alignments with unreliability at certain sites. We compared the masked and unmasked alignments and found similar results: 76% of alignments in both cases favor 2H over 1H; 77% of unmasked alignments and 74% of masked alignments choose 3H over 1H (Table 4). While there is certainly some false signal due to misalignment, it is unlikely to be the dominant factor here. Nonetheless, care must be taken not to over-interpret MH findings when the alignments are uncertain.

Discussion

Nearly three in five empirical alignments considered here provide strong statistical support that at least some of the substitutions are not well modeled by standard codon models. More than one in five prefer to have direct three-hit substitutions accounted for explicitly in the models. Substitutions involving serine codons, which are unique among the amino-acids in that they comprise two islands which are two or three nucleotide changes from each other, are prominent in driving statistical signal for these preferences, especially if they occur along short branches. Many other amino-acid pairs are also involved in such exchanges, indicating that not all of the statistical signal is due to serine codons, although in a typical alignment only a small fraction of sites (about 1%) prefer multiple hit models strongly.

Many previous studies have provided evidence that evolutionary models with multiple hits provide better fit to the data, but the scale of this phenomenon in the comparative evolutionary context has not been fully appreciated, although the interest in model development in this area is being rekindled. Our results also show that the inclusion of multiple-hit model parameters changes ω estimates, and with them—potentially alter inferences of positive selection, which was demonstrated for branch-site tests, [20], and for data simulated with multiple hits but analyzed with standard models [14]. Additionally, traditional models may slightly overestimate branch lengths for data where multiple-hit models provide better fits.

How much of this apparent support for multiple-hits comes from biological reality, and how much from statistical artifacts, or other unmodeled evolutionary processes—the so-called phenomenological load [36]? Our simulation studies provide compelling evidence that the tests we use here are statistically well-behaved and possess good power, i.e. our positive findings are unlikely to be primarily a result of statistical misclassification. Other confounders, especially alignment error, have the potential to mislead the tests, but only at levels that appear higher than what is likely present in most biological alignments. In addition, there are some datasets (e.g., HIV reverse transcriptase), where alignment is not in question due to low biological insertion/deletion rates or structural information, and these data still support non-zero multiple-hit rates as well.

There is an abundance of data and examples of doublet substitutions in literature, and mechanistic explanations for them, e.g., due to polymerase zeta [7] exist. There are several papers arguing that the numbers of apparent triple-hits occurring in sequences is greater than what we would expect solely from random mutation [3739], however the mechanism (if it exists) by which they might occur is speculative. As one option, Sakofsky et al. [40] have suggested that DNA repair mechanisms could help explain multi-nucleotide mutations.

Our analyses indicate that much, but not all, of the support for non-zero triple-hit rates derives from serine codon island jumping, particularly in cases when these jumps must occur over a short branch in the tree. Comparative species data might lack the requisite resolution to discriminate between instant multiple base changes and a rapid succession of single nucleotide changes spurred on by selection; the literature is split on which mechanism is primal [5, 6]. Such a common phenomenon is worth further investigation, in our opinion.

Our evolutionary models are broadly comparable to several others that have been published in this domain, some of which have more parametric complexity [14], or consider effects of substitutions spanning codon boundaries [12]. Our novel contributions are direct tests for the importance of synonymous island jumping, and a simple evidence ratio approach to identify and categorize specific sites that benefit from non-zero multiple hit rates. These models are easy to fit computationally, with roughly the same cost as would be required for an ω−based positive selection analysis, and we provide an accessible implementation for researchers to use them. Further modeling extensions, e.g. the inclusion of synonymous rate variation, branch-site effects, etc., can be easily incorporated.

Supporting information

S1 Table. Estimated ω rate distributions for benchmark datasets for different models on the benchmark datasets.

E[ω]: the mean ω value for the 1H model. E[ω]:2HE[ω]:1H: the ratio of mean ω estimates from 2H and 1H models. δ:3H+δ:2H: the ratio of δ estimates from 3H+ and 2H models. The datasets are sorted by increasing values of the E[ω]:2HE[ω]:1H column. Genes where there was significant evidence (LRT p < 0.05) of non-zero 2H rates are bolded, and those where there is evidence of non-zero 3H rates is underlined.

(PDF)

S1 Fig. The effect of model choice on rate estimates.

Point estimates of global rate parameters under different models for each of the empirical datasets.

(TIF)

S2 Fig. The fraction of sites with strong MH model preference.

Histograms are over alignments where there was significant (p < 0.01) support for the corresponding model: 20, 338 for 2H:1H (gray) and 7664 for 3H:2H (red).

(TIF)

S3 Fig. False positive rates for LRTs on simulated data.

Results are shown for two sequence analyses (left) and multiple sequence analyses (right). For the two sequence simulations, we stratified the simulations by the length of the branch, T, (the range is labeled in the figure) measured in expected substitutions per site. The dotted line shows the nominal expectation (rejection rate = nominal p-value).

(TIF)

S4 Fig. Indel rate verse TH rate.

Alignments with indel were simulated using INDELible across using the Dropsophila adh tree and alignment length using GY94 M3 model with site-to-site ω variation. LRT p-values and rejection rates (FPR, at p ≤ 0.05) are shown for different tests in the top row. The bottom row shows estimated δ and ψ rates as a function of simulated indel rates, as well as the number of sites inferred to have high evidence ratios (ER) for 2H or 3H modes. The plot on the bottom right shows the average fraction of a sequence that in an alignment that is comprised of gaps is shown for simulated data, and empirical collections.

(TIF)

S5 Fig. Branch-length estimate behavior under different models.

Simulated Data (1H): null simulations (1H model). Simulated Data (MH): power simulations. Selectome: empirical data. Red lines are drawn with least squares linear regression whose estimates slopes and intercepts as well as proportions of variance (R2) explained are added to each plot.

(TIF)

Data Availability

The study dataset is available on Zenodo, at https://doi.org/10.5281/zenodo.4570785.

Funding Statement

This work was supported by National Institutes of Health grants R01 GM093939 and R01 AI134384 to SKP.

References

  • 1. Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Molecular Biology and Evolution. 1994;11(5):715–724. [DOI] [PubMed] [Google Scholar]
  • 2. Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular biology and evolution. 1994;11(5):725–736. 10.1186/s13059-014-0542-8 [DOI] [PubMed] [Google Scholar]
  • 3. Halpern AL, Bruno WJ. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol. 1998;15(7):910–7. 10.1093/oxfordjournals.molbev.a025995 [DOI] [PubMed] [Google Scholar]
  • 4.Wolfe KH, Sharp PM. Journal of Molecular Evolution Mammalian Gene Evolution: Nucleotide Sequence Divergence Between Mouse and Rat; 1993. [DOI] [PubMed]
  • 5. Averof M, Rokas A, Wolfe KH, Sharp PM. Evidence for a high frequency of simultaneous double-nucleotide substitutions. Science. 2000;287(5456):1283–1286. 10.1126/science.287.5456.1283 [DOI] [PubMed] [Google Scholar]
  • 6. Rogozin IB, Belinky F, Pavlenko V, Shabalina SA, Kristensen DM, Koonin EV. Evolutionary switches between two serine codon sets are driven by selection. Proceedings of the National Academy of Sciences of the United States of America. 2016;113(46):13109–13113. 10.1073/pnas.1615832113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Harris K, Nielsen R. Error-prone polymerase activity causes multinucleotide mutations in humans. Genome Research. 2014;24(9):1445–1454. 10.1101/gr.170696.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Kaplanis J, Akawi N, Gallone G, McRae JF, Prigmore E, Wright CF, et al. Exome-wide assessment of the functional impact and pathogenicity of multinucleotide mutations. Genome research. 2019;29(7):1047–1056. 10.1101/gr.239756.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Chen JM, Cooper DN, Férec C. A new and more accurate estimate of the rate of concurrent tandem-base substitution mutations in the human germline: 0.4% of the single-nucleotide substitution mutation rate. Hum Mutat. 2014;35(3):392–4. 10.1002/humu.22501 [DOI] [PubMed] [Google Scholar]
  • 10. Okada M, Misumi Y, Ueda M, Yamashita T, Masuda T, Tasaki M, et al. A novel transthyretin variant V28S (p.V48S) with a double-nucleotide substitution in the same codon. Amyloid. 2017;24(4):231–232. 10.1080/13506129.2017.1381082 [DOI] [PubMed] [Google Scholar]
  • 11. Kosiol C, Holmes I, Goldman N. An empirical codon model for protein sequence evolution. Molecular Biology and Evolution. 2007;24(7):1464–1479. 10.1093/molbev/msm064 [DOI] [PubMed] [Google Scholar]
  • 12. Whelan S, Goldman N. Estimating the frequency of events that cause multiple-nucleotide changes. Genetics. 2004;167(4):2027–2043. 10.1534/genetics.103.023226 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Zaheri M, Dib L, Salamin N. A generalized mechanistic codon model. Molecular Biology and Evolution. 2014. 10.1093/molbev/msu196 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Dunn KA, Kenney T, Gu H, Bielawski JP. Improved inference of site-specific positive selection under a generalized parametric codon model when there are multinucleotide mutations and multiple nonsynonymous rates. BMC Evolutionary Biology. 2019;19(1):22. 10.1186/s12862-018-1326-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Doron-Faigenboim A, Pupko T. A combined empirical and mechanistic codon model. Mol Biol Evol. 2007;24(2):388–97. 10.1093/molbev/msl175 [DOI] [PubMed] [Google Scholar]
  • 16. Miyazawa S. Selective constraints on amino acids estimated by a mechanistic codon substitution model with multiple nucleotide changes. PLoS One. 2011;6(3):e17244. 10.1371/journal.pone.0017244 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Zoller S, Schneider A. A new semiempirical codon substitution model based on principal component analysis of mammalian sequences. Mol Biol Evol. 2012;29(1):271–7. 10.1093/molbev/msr198 [DOI] [PubMed] [Google Scholar]
  • 18. De Maio N, Holmes I, Schlötterer C, Kosiol C. Estimating empirical codon hidden Markov models. Mol Biol Evol. 2013;30(3):725–36. 10.1093/molbev/mss266 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Jones CT, Youssef N, Susko E, Bielawski JP. A Phenotype–Genotype Codon Model for Detecting Adaptive Evolution. Systematic Biology. 2019;69(4):722–738. 10.1093/sysbio/syz075 [DOI] [PubMed] [Google Scholar]
  • 20. Venkat A, Hahn MW, Thornton JW. Multinucleotide mutations cause false inferences of lineage-specific positive selection. Nature Ecology and Evolution. 2018;2(8):1280–1288. 10.1038/s41559-018-0584-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Kosakovsky Pond S, Delport W, Muse SV, Scheffler K. Correcting the Bias of Empirical Frequency Parameter Estimators in Codon Models. PLoS ONE. 2010;5(7):e11230. 10.1371/journal.pone.0011230 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Kosakovsky Pond SL, Poon AFY, Velazquez R, Weaver S, Hepler NL, Murrell B, et al. HyPhy 2.5-A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies. Mol Biol Evol. 2020;37(1):295–299. 10.1093/molbev/msz197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Murrell B, Weaver S, Smith MD, Wertheim JO, Murrell S, Aylward A, et al. Gene-Wide Identification of Episodic Selection. Molecular Biology and Evolution. 2015;32(5):1365–1371. 10.1093/molbev/msv035 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Moretti S, Laurenczy B, Gharib WH, Castella B, Kuzniar A, Schabauer H, et al. Selectome update: quality control and computational improvements to a database of positive selection. Nucleic Acids Research. 2014;42(Database issue):917–21. 10.1093/nar/gkt1065 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Shultz AJ, Sackton TB. Immune genes are hotspots of shared positive selection across birds and mammals. eLife. 2019;8. 10.7554/eLife.41815 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Enard D, Cai L, Gwennap C, Petrov DA. Viruses are a dominant driver of protein adaptation in mammals. eLife. 2016;5. 10.7554/eLife.12469 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Mannino F, Wisotsky S, Kosakovsky Pond SL, Muse SV. Equiprobable discrete models of site-specific substitution rates underestimate the extent of rate variability. PLoS One. 2020;15(3):e0229493. 10.1371/journal.pone.0229493 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Fletcher W, Yang Z. INDELible: A flexible simulator of biological sequence evolution. Molecular Biology and Evolution. 2009;26(8):1879–1888. 10.1093/molbev/msp098 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Weaver S, Shank SD, Spielman SJ, Li M, Muse SV, Kosakovsky Pond SL. Datamonkey 2.0: A Modern Web Application for Characterizing Selective and Other Evolutionary Processes. Mol Biol Evol. 2018;35(3):773–777. 10.1093/molbev/msx335 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Wisotsky SR, Kosakovsky Pond SL, Shank SD, Muse SV. Synonymous site-to-site substitution rate variation dramatically inflates false positive rates of selection analyses: ignore at your own peril. Molecular Biology and Evolution. 2020. 10.1093/molbev/msaa037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Yang Z. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Molecular Biology and Evolution. 1998;15(5):568–573. 10.1093/oxfordjournals.molbev.a025957 [DOI] [PubMed] [Google Scholar]
  • 32. Yokoyama S, Tada T, Zhang H, Britt L. Elucidation of phenotypic adaptations: Molecular analyses of dim-light vision proteins in vertebrates. Proceedings of the National Academy of Sciences. 2008;105(36):13480–13485. 10.1073/pnas.0802426105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Pupko T, Pe I, Shamir R, Graur D. A Fast Algorithm for Joint Reconstruction of Ancestral Amino Acid Sequences. Molecular Biology and Evolution. 2000;17(6):890–896. 10.1093/oxfordjournals.molbev.a026369 [DOI] [PubMed] [Google Scholar]
  • 34. Self SG, Liang KY. Asymptotic Properties of Maximum Likelihood Estimators and Likelihood Ratio Tests Under Nonstandard Conditions. J Am Stat Assoc. 1987;82(398):605–310. 10.1080/01621459.1987.10478472 [DOI] [Google Scholar]
  • 35. Katoh K, Misawa K, Kuma Ki, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002;30(14):3059–66. 10.1093/nar/gkf436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Jones CT, Youssef N, Susko E, Bielawski JP. Phenomenological Load on Model Parameters Can Lead to False Biological Conclusions. Mol Biol Evol. 2018;35(6):1473–1488. 10.1093/molbev/msy049 [DOI] [PubMed] [Google Scholar]
  • 37. Bazykin GA, Kondrashov FA, Ogurtsov AY, Sunyaev S, Kondrashov AS. Positive selection at sites of multiple amino acid replacements since rat-mouse divergence. Nature. 2004;429(6991):558–562. 10.1038/nature02601 [DOI] [PubMed] [Google Scholar]
  • 38. Schrider DR, Hourmozdi JN, Hahn MW. Pervasive multinucleotide mutational events in eukaryotes. Current Biology. 2011. 10.1016/j.cub.2011.05.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Smith NG, Hurst LD. The causes of synonymous rate variation in the rodent genome. Can substitution rates be used to estimate the sex bias in mutation rate? Genetics. 1999;152(2):661–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Sakofsky CJ, Roberts SA, Malc E, Mieczkowski PA, Resnick MA, Gordenin DA, et al. Break-induced replication is a source of mutation clusters underlying kataegis. Cell Rep. 2014;7(5):1640–1648. 10.1016/j.celrep.2014.04.053 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Marc Robinson-Rechavi

4 Dec 2020

PONE-D-20-31568

Extra base hits: widespread empirical support for instantaneous multiple-nucleotide changes.

PLOS ONE

Dear Dr. Lucaci,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

As you will see, both reviewers found the writing insufficiently clear and precise. Please take this comment seriously, as it is one of the key criteria of PLOS One, and it is impossible to evaluate well the science if the writing is unclear. In your revision, the key point is thus to improve the writing, but please address all comments by the reviewers.

Please submit your revised manuscript by Jan 14 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Marc Robinson-Rechavi

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Lucaci et al present an empirical analysis to assess how widespread single vs multi-nucleotide hits are. On the whole, I find this research conceptually interesting, timely, and reasonable for publication. However, I found this paper quite difficult to review. Far too much of the paper is not written in complete sentences (the ideal number of incomplete sentences is 0), references appear to use the wrong LaTeX command (e.g. [5] should not be used as a sentence subject..), acronyms are not clearly defined when introduced (and from my reading seemed to shift in their definition? or introduced and then called something else later?), and the presentation of the model itself is very confusing. The proposed model itself is NOT that complex, so it is troubling that the presentation is not straightforward to follow.

Further, I have some difficulty clearly interpreting some of figures. For example, the monotone splines in Figure 2 are very misleading especially given that it seems there are wildly different sample sizes across the X-axis. I do not see a clear scientific interpretation of a monotone spline in this context. Fig S3 is ambiguously labeled, the double axes on Fig 4 (with what seem to be semi-outlandish ranges for the left-hand axis?) are virtually impossible to interpret reliably.

Based on my reading, the current manuscript version suffers from an overall lack-of-editing. It was not submitted in an "polished" form. I will be glad to review another version after the text has been significantly cleaned, properly organized, and figures have been designed to yield clear interpretations. Again, I believe the science here is solid, but it is not currently presented clearly at all.

Reviewer #2: Lucaci et al present codon substitution models that allow multiple nucleotide substitution in a codon. They show that such models provide better fit in LRTs. The authors show the results in a sensible flow, but I feel that the picture is quite incomplete.

According to the study, it seems that the majority of datasets and positions across an alignment do have preference for more complex models. It does make sense that more parameter-free models fit the data better, but I was not convinced that they produce better trees. Previous studies show that simpler models lead to more accurate branch-length estimates. I wonder whether these results are biased by the hypothesis testing method. I would suggest to compare the resulting trees of the simulation studies to the starting trees and look at the distances, e.g., branch-length distance. It could be that 3H models are better fitted but generate overestimated branch lengths whereas 1H models are more conservative and produce more accurate trees.

Other than that, I think that the clarity of the text, especially in the Introduction and Methods, could be improved. I had a hard time understanding the details before I reached the Results section. For example, models notations in the first paragraph of the Methods section are only defined in the Results.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 1

Marc Robinson-Rechavi

10 Feb 2021

PONE-D-20-31568R1

Extra base hits: widespread empirical support for instantaneous multiple-nucleotide changes.

PLOS ONE

Dear Dr. Lucaci,

Thank you for submitting your manuscript to PLOS ONE. I am sorry for the delay. Reviewer 1 has clarified that they no longer wished to review this manuscript, so I took time to evaluate it myself rather than lose even more time finding another reviewer. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

I agree with reviewer 2 that the work is ready for publication in PLOS One, but that the writing should be carefully checked. In addition to the errors noted by the reviewer, I have noticed the following:

"signal due to misaligned" misses a word, probably "misaligned sites".

"for one 384 branch-site tests, [20]" if there is one test it shouldn't have an 's'.

Please submit your revised manuscript by Mar 27 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Marc Robinson-Rechavi

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: The authors have addressed my concerns. The flow of the manuscript was much improved and it is now much easier to follow and understand.

There are several English mistakes in some parts of the text - I suggest that the authors would thoroughly review the writing. Examples: "have have" in line 20; “traction” should be “attraction” in line 45; "these models *have* been unable" in line 44).

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 2

Marc Robinson-Rechavi

25 Feb 2021

Extra base hits: widespread empirical support for instantaneous multiple-nucleotide changes.

PONE-D-20-31568R2

Dear Dr. Lucaci,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Marc Robinson-Rechavi

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Marc Robinson-Rechavi

3 Mar 2021

PONE-D-20-31568R2

Extra base hits: widespread empirical support for instantaneous multiple-nucleotide changes.

Dear Dr. Lucaci:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Marc Robinson-Rechavi

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Estimated ω rate distributions for benchmark datasets for different models on the benchmark datasets.

    E[ω]: the mean ω value for the 1H model. E[ω]:2HE[ω]:1H: the ratio of mean ω estimates from 2H and 1H models. δ:3H+δ:2H: the ratio of δ estimates from 3H+ and 2H models. The datasets are sorted by increasing values of the E[ω]:2HE[ω]:1H column. Genes where there was significant evidence (LRT p < 0.05) of non-zero 2H rates are bolded, and those where there is evidence of non-zero 3H rates is underlined.

    (PDF)

    S1 Fig. The effect of model choice on rate estimates.

    Point estimates of global rate parameters under different models for each of the empirical datasets.

    (TIF)

    S2 Fig. The fraction of sites with strong MH model preference.

    Histograms are over alignments where there was significant (p < 0.01) support for the corresponding model: 20, 338 for 2H:1H (gray) and 7664 for 3H:2H (red).

    (TIF)

    S3 Fig. False positive rates for LRTs on simulated data.

    Results are shown for two sequence analyses (left) and multiple sequence analyses (right). For the two sequence simulations, we stratified the simulations by the length of the branch, T, (the range is labeled in the figure) measured in expected substitutions per site. The dotted line shows the nominal expectation (rejection rate = nominal p-value).

    (TIF)

    S4 Fig. Indel rate verse TH rate.

    Alignments with indel were simulated using INDELible across using the Dropsophila adh tree and alignment length using GY94 M3 model with site-to-site ω variation. LRT p-values and rejection rates (FPR, at p ≤ 0.05) are shown for different tests in the top row. The bottom row shows estimated δ and ψ rates as a function of simulated indel rates, as well as the number of sites inferred to have high evidence ratios (ER) for 2H or 3H modes. The plot on the bottom right shows the average fraction of a sequence that in an alignment that is comprised of gaps is shown for simulated data, and empirical collections.

    (TIF)

    S5 Fig. Branch-length estimate behavior under different models.

    Simulated Data (1H): null simulations (1H model). Simulated Data (MH): power simulations. Selectome: empirical data. Red lines are drawn with least squares linear regression whose estimates slopes and intercepts as well as proportions of variance (R2) explained are added to each plot.

    (TIF)

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Data Availability Statement

    The study dataset is available on Zenodo, at https://doi.org/10.5281/zenodo.4570785.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES