Abstract
Evolutionary pathways describe trajectories of biological evolution in the space of different variants of organisms (genotypes). The probability of existence and the number of evolutionary pathways that lead from a given genotype to a better-adapted genotype are important measures of accessibility of local fitness optima and the reproducibility of evolution. Both quantities have been studied in simple mathematical models where genotypes are represented as binary sequences of two types of basic units, and the network of permitted mutations between the genotypes is a hypercube graph. However, it is unclear how these results translate to the biologically relevant case in which genotypes are represented by sequences of more than two units, for example four nucleotides (DNA) or 20 amino acids (proteins), and the mutational graph is not the hypercube. Here we investigate accessibility of the best-adapted genotype in the general case of K > 2 units. Using computer generated and experimental fitness landscapes we show that accessibility of the global fitness maximum increases with K and can be much higher than for binary sequences. The increase in accessibility comes from the increase in the number of indirect trajectories exploited by evolution for higher K. As one of the consequences, the fraction of genotypes that are accessible increases by three orders of magnitude when the number of units K increases from 2 to 16 for landscapes of size N ∼ 106 genotypes. This suggests that evolution can follow many different trajectories on such landscapes and the reconstruction of evolutionary pathways from experimental data might be an extremely difficult task.
Author Summary
Biological evolution is driven by heritable, genetic alterations that affect the fitness of organisms. However, the pool of “fitter” variants (genotypes) is often restricted and it is not at all obvious how evolution finds its way from low-fitness to high-fitness genotypes in a complex, multidimensional “fitness landscapes” with many peaks (fit organisms) and valleys (unfit ones). To address this question we investigate how likely it is for biological evolution to find a way “uphill” from a lower-fitness organism to the best adapted organism. We discover that the accessibility of the fittest organism depends on the number of types of basic “units” used to encode genotypes. These units can be, for example, the four DNA nucleotides A,T,C,G, or the ∼20 amino acids used for synthesizing proteins, and the choice of the most appropriate unit is dictated by how the genotypes and the fitnesses are related—a relationship that researchers have begun to unveil only recently. We find that increasing the number of units strongly increases the probability that there will be at least one uphill path to the best-adapted genotype, and the number of evolutionary pathways leading to it. Our findings suggest that biological evolution can follow many more pathways than previously thought.
Introduction
Biological evolution can be visualised as a dynamical process in which reproduction and mutations cause organisms to move in the “fitness landscape” [1]—a complex, multidimensional landscape full of ridges and local peaks in which space represents genetic variants (genotypes) of organisms, and “height” corresponds to organism’s fitness—a measure of the reproductive success.
The structure of the fitness landscape (FL) affects the speed and predictability of biological evolution and is thus very important for practical reasons such as forecasting the evolution of resistance to antibiotics [2–7]. However, fitness landscapes are astronomically large even for the simplest organisms: the number of possible genotypes with the same genome size as the smallest known bacterial genome to-date (Nasuia deltocephalinicola [8], L = 112kbp) is N = 4112000 ≈ 1067430 genotypes. For this reason only small fragments of fitness landscapes have been determined experimentally [3, 9–13] and progress has been made mostly by studying adaptation in computer-generated landscapes (see e.g. [14–16]) or in simple theoretical models (see e.g. House-of-Cards [17], NK [18], Mount Fuji [19], holey landscapes [20]). In these models genotypes are typically represented as binary sequences, where 0 and 1 correspond to two different alleles of a gene or a particular point mutation being absent/present. This choice has been dictated by three factors: (i) availability of empirical data in which genotypes usually differ by specific point mutations from a certain “wild-type” genotype (cf. Refs. [10, 11]), (ii) most studies focused on the shortest evolutionary pathways for which (as we will see later) the number of alleles is not important, (iii) under the assumption that offspring differ from their parent by only one character, the resulting graph of permitted mutations is the hypercube graph and this greatly facilitates mathematical analysis [17, 21].
However, the space of real genotypes and their mutational network differ significantly from the space of binary sequences and the hypercube graph. A DNA sequence is made of four different nucleotides (A,T,G,C), hence at the most fundamental level genotypes must be represented as sequences of K = 4 different characters. Another example with K = 4 is RNA folding in which relationship between the coding sequence and the corresponding secondary-structure defines a landscape of energetically favorable RNA conformations [22–24]. On a more coarse-grained level, assuming that fitness is determined primarily by protein structure, a genotype can be represented as a sequence of characters from a set of K > 20 symbols which correspond to 20 standard amino acids [25–27] and their post-translational modifications. Finally, if we consider genotypes as sequences of genes, K is equal to the number of different variants (alleles) of genes and is practically infinite. All these examples are valid descriptions of the genotype space but they lead to different structures of the corresponding fitness landscape: different genotype-to-fitness mappings and different structures of the graphs of possible transitions (mutations) between the genotypes.
Here we investigate how the global structure of the fitness landscape is affected by the number K of different “units”, or characters, of which sequences of length L representing genotypes are made, and the structure of the corresponding mutational graph (Fig 1). We focus on evolutionary accessibility of the global fitness maximum. An accessible pathway is a trajectory in the genotype space along which fitness increases monotonically. Accessible pathways are important because in the low mutation-strong selection limit [28–30] such pathways are good approximations to evolutionary trajectories obtained in computer models of biological evolution, regardless of the details of the model. Accessible pathways are thus expected to represent real evolutionary pathways, at least in some experiments [3].
While we intuitively expect that increasing the number of coding units and hence the connectivity of the mutational graph should increase accessibility if the total number of genotypes N = KL is kept fixed, it is not obvious how increasing L affects accessibility for different K. In the case of well-studied binary landscapes (K = 2) the probability that a maximally random (House-of-Cards) fitness landscape has at least one accessible pathway approaches zero as L → ∞ [30–32]. While many evolutionary models have been studied in the case K > 2 [33–36], these works have not addressed accessibility of fitness landscapes. It is also unclear what happens for K > 2 in the presence of correlations between the fitnesses that exist in many real fitness landscapes [10].
To address these questions, we analyse computer-generated and experimental fitness landscapes with various K and L. First, we show that if maximally random landscapes with the same number of genotypes N = KL but different K are compared, larger K leads to higher accessibility. In contrast to the binary case K = 2, in the large-L limit the landscape with K > 2 contains on average at least one pathway. We demonstrate that as K increases, accessible pathways become longer and contain an increasing number of indirect mutations which do not decrease the distance to the fittest genotype. We also show that accessibility depends strongly on the fitness of the initial genotype, f0, and that a transition occurs at some critical f0 above which there are no accessible pathways. Finally, we confirm our predictions for two experimentally determined fitness landscapes.
Modelling framework
We consider the space of all possible sequences of length L made of symbols taken from a K-letter alphabet. Each sequence i = 0, …, KL − 1 represents a genotype in the discrete genotype space, with i = 0 corresponding to sequence {0, 0, 0, …, 0} and i = KL − 1 to {K − 1, K − 1, K − 1, …, K − 1}. The genotypes are connected by a network of mutations that change one genotype into another one. We assume that any position in the sequence can mutate with equal probability and that a mutation at a given position changes the symbol at this position to another randomly selected symbol. Therefore, two genotypes are connected if they differ at exactly one position.
To create a fitness landscape, we assign random fitnesses {fi} to all genotypes. Each fitness fi is a random number drawn independently from the probability distribution uniform on [0…1). Next, we re-index all genotypes such that genotype {K − 1, K − 1, K − 1, …, K − 1} has the maximum fitness (Methods). This procedure creates an ensemble of maximally-random or “rugged” fitness landscapes with no correlations between the fitnesses of adjacent genotypes. This is arguably the simplest non-trivial model of the fitness landscape with many local maxima and minima and thus it presents an ideal test ground to investigate the role of connectivity on the accessibility of FLs. Although real landscapes are known to be moderately correlated [10, 37–39], we shall show later that our conclusions are robust in the presence of such correlations.
We define a pathway in the genotype space as a trajectory that connects two given genotypes following the links of the mutational graph. We call a pathway accessible if fitness increases monotonically from the start to the end point along the pathway. We consider only such evolutionary trajectories that begin at genotype {0, 0, 0, …, 0} and end at the best-fit, antipodal genotype {K − 1, K − 1, K − 1, …, K − 1}. Following Ref. [10], we call a FL accessible if there is at least one accessible path between the target and the initial genotype (Fig 1). We also define accessibility A as the probability that a FL randomly selected from our statistical ensemble is accessible.
Results
Computational model
Accessibility increases with K
We investigated how accessibility varied with the number of genotypes N = KL for different K. We first considered only pathways without “backward mutations”, that is pathways along which the Hamming distance from the initial genotype increased (or remained the same for K > 2) with each step. Fig 2a shows that accessibility A slowly decreases with increasing N, for all K. When K = 2, we recover the mathematical result of Hegarty et al. [31] that A tends to zero for N → ∞. However, when K > 2, accessibility approaches a finite, non-zero value in the large-N limit. The estimated asymptotic accessibility A∞ = A(N → ∞) increases with K and equals 47%, 69% and 81% for K = 4, 8, and 16 (S1 Fig) (standard errors <1%). Therefore, for sufficiently long sequences accessibility is larger for a larger number of basic unit types.
Fig 2b shows that the same trend holds when all pathways and not just the ones without backward mutations are considered. However, in the latter case A∞ remains finite also for K = 2. A comparison of panels (a), (b) in Fig 2 shows that, except for the case K = 2, accessibility is almost identical regardless of whether pathways with backward mutations are included or not. We shall see later that A∞ does not change much for K > 2 because the fraction of backward mutations is small compared to other (distance-neutral) mutations, see Fig 3. When K = 2 we recover the analytical prediction A = 0.1186… from Berestycki et al. [32]. We also note that if only shortest pathways are included for K > 2, accessibility tends to zero as N → ∞ because in this case only mutations that change the symbol “0” to K − 1 are permitted and thus the space of permitted genotypes is restricted to that of binary sequences (K = 2). Since pathways without backward mutations form only a small subset of all pathways for K > 2 (S2 Fig) in the rest of the paper we shall consider all pathways when discussing accessibility.
Larger K increases the length of accessible pathways and the fraction of indirect mutations in such pathways
We next investigated the attributes of individual accessible pathways: their length and the fraction of backward mutations. Increasing K while keeping L fixed makes accessible pathways longer (Fig 3a). Interestingly, the average length scales as ∼0.5LK and only weakly depends on N (S4 Fig). Notice that by fixing L we keep the same mutational (Hamming) distance between the initial and target genotype, thus any increase in pathway length for K > 2 must be caused by mutations that either do not change the Hamming distance from the initial genotype (distance-neutral mutations) or decrease it (backward mutations), which we jointly refer to as indirect mutations. This is indeed what happens; S2 Fig shows that the probability that an accessible path has backward or distance-neutral mutations increases with K if we compare fitness landscapes with the same L. Moreover, the higher K the more dominating are distance-neutral mutations (Fig 3a). For example, for K = 16 and L = 5 accessible pathways have on average 5% of backward, 15% of forward and 80% of distance-neutral mutations. This is in contrast to the binary case K = 2 and L = 5 in which forward mutations account for 98% of all mutations, and the remaining 2% are backward mutations. Keeping N or K fixed while changing K or L (Fig 3b and 3c) does not qualitatively change this picture. Hence, higher accessibility of FLs with multiple coding units comes with a price: accessible pathways become longer, and more mutational steps are required to reach the best-fit genotype than in the binary case (K = 2).
Initial fitness affects the number of pathways and accessibility
So far we have been assigning a random fitness to the initial genotype. We may however expect that the fitness f0 of the initial genotype strongly affects the number of accessible pathways from that genotype to the antipodal, best-fit genotype. If f0 was close to the maximal fitness we would intuitively expect few (or no) pathways, whereas for small f0 such pathways would be more likely. This phenomenon was first noted in Ref. [40] under notion of “accessibility percolation” and further studied for binary sequences (K = 2) [31, 32, 41]. Fig 4a shows that this is also the case for K > 2: the number of accessible pathways decreases with increasing f0. Moreover, accessibility sharply decreases from ≈ 1 to zero as f0 crosses a critical fcrit, different for each K (Fig 4b). This transition between accessible and inaccessible FLs becomes sharper with increasing L, and for L → ∞ the numerical value of fcrit approaches A∞. Thus, for L large enough, while accessible pathways exist for K = 2 only if the initial fitness is very low (f0 < 0.1), the number of such pathways is non-zero even for relatively large f0 when K > 2.
Accessible pathways can cover most of the FL
Fig 4a shows that the number of accessible pathways increases by several orders of magnitude when K increases from 2 to 16. This is associated with a rapid increase in the number of genotypes belonging to accessible pathways, i.e., the coverage of the FL (Fig 5). In particular, for K = 2 the average coverage of the FL equals 0.02%, whereas for K = 16 around 25% of genotypes belong to accessible pathways. Coverage can exceed 50% when the initial fitness f0 is small (S5 Fig). The increase in the number of accessible pathways and accessible genotypes has important consequences for the a posteriori predictability of biological evolution: given the initial and final genotypes, we cannot reliably predict the pathway biological evolution might have followed if K > 2.
Experimental fitness landscapes
We next verified the predictions of our computer model using two experimentally determined fitness landscapes. We first analysed the landscape of DNA-protein affinities [39], with binding affinity to the fluorescent protein allophycocyanin as a proxy for fitness. By selecting subsets of genotypes (Methods) we generated two ensembles of fitness landscapes with L = 5, K = 4 and L = 10, K = 2 (N = 1024 genotypes in each case).
We obtained accessibility A = 0.669(5) for K = 2 and A = 0.864(4) for K = 4, thus confirming our prediction that A increases with K (Fig 6a–6c). The values of A are higher than for the maximally random landscape (Fig 2) because the fitness values are negatively correlated with the distance to the target genotype, i.e., the closer to the target genotype, the higher is the fitness (Fig 6b). If we randomize the fitness landscapes by randomly swapping the fitnesses, the expected A decreases and it matches the value obtained for the maximally random landscape. Correlations between the fitness and the distance of a genotype to the target genotype can therefore significantly increase accessibility. We note that this is not caused by the difference between the distributions of f0 for randomized and original FLs because these distributions strongly overlap (Fig 6c). This observation indicates that accessibility can be enhanced by fitness-to-distance correlations in FLs, in accordance with [30].
We further investigated how the fitness of the initial genotype affects accessibility, i.e., whether there is some critical value of initial fitness separating regions of FLs with high and low accessibility. To account for the non-uniform distribution of the experimental fitnesses we transformed the fitnesses such that the new fitness distribution was uniform, while preserving the order of the fitnesses and local correlations (epistasis), i.e., if f0 < f1 < f2 < … < fN − 1 then the transformed fitnesses . The estimated critical values of are 0.69(1) and 0.90(1) for K = 2 and K = 4 respectively (Fig 7a). These values are close to the estimated values of A from Fig 6a; the critical fcrit increases with K similarly as in the maximally random model.
The presence of fitness-to-distance correlations affects also other properties of FLs: there are more accessible pathways leading to the best-fit genotype (Fig 7b), the pathways are longer and the fraction of accessible genotypes is higher than in the maximally random model (S1 Table). Moreover, the fractions of accessible genotypes for K = 2 and K = 4 are large and behave similarly when plotted against (Fig 7c), but when fitness-to-distance correlations are removed the two curves separate. Correlations thus enhance accessibility for both K = 2 and K = 4, making a posteriori predictability of evolutionary trajectories even more difficult than in the case of the maximally-random FL.
In order to confirm that the observed behaviour is not unique to the specific experimental data set of Ref. [39], we analysed a second landscape of RNA-protein binding affinity [42]. We assumed fitness to be proportional to the rate constant for the reaction of the RNA-binding subunit C5 of RNase P with all variants of the 6-nucleotide recognition site (Methods). Proceeding similarly as before, we generated sub-FLs with L = 3, K = 4 and L = 6, K = 2 (N = 64 genotypes). The results for this second landscape (Fig 6d–6f, Fig 7d–7f, S1 Table) turned out to be consistent with the results reported above.
Note that although our approach in which we identify fitness with binding affinity of a protein to an RNA or a DNA sequence is commonly used in the literature, the fitness defined in this way is not “true fitness”, i.e., it does not have to directly correspond to reproductive success of an organism in which the described interactions take place.
Discussion
Although the concept of the fitness landscape was proposed almost a century ago [1], many open problems remain. Setting aside the appropriateness of the FL to model evolution in the presence of interactions between different organisms, one of the important questions is the accessibility of local fitness peaks: are such peaks sufficiently well connected, or are they separated by “fitness valleys” which are so deep that evolution cannot reach most of them? Another question concerns the predictability of evolution: how many evolutionary trajectories lead from one genotype to another, better adapted genotype?
Our results show that the answer to these questions depends on how genotypes are connected in the genotype space, i.e., which transitions between the genotypes are permitted and which are not. Since genetic information is stored in a linear molecule (DNA), genotypes are most naturally represented as linear sequences of some “units” of information, or building blocks, and this already imposes restrictions on the structure of the mutational graph. Moreover, the nature of the smallest unit of information and how mutations convert one unit into another further affects how genotypes are connected and how they are mapped to fitnesses. The choice of the appropriate “unit” depends on the subset of the fitness landscape considered in a particular research problem. While the presence or absence (K = 2) of a certain point mutation leads to a simple mathematical model, this approach is limited to small fragments of fitness landscapes and it requires the knowledge of which mutations are important and which can be ignored [11]. How many unit types are necessary to model the most general situation? Since the DNA code uses four nucleotides, K = 4 building blocks should in theory suffice to represent all possible genotypes. On the other hand, the redundancy of the genetic code might suggest that the most appropriate unit should rather be an amino acid, in which case K ≈ 20. However, this could be disputed because there is evidence that synonymous codons can be translated at different rates [43–45] which may affect fitness.
We have seen that increasing the number of basic units or “alleles” K generally increases accessibility. This is not just a trivial consequence of increased connectivity of the mutational graph (number of permitted mutations for each genotype). Connectivity also increases with increasing sequence length L, but this actually slightly decreases accessibility (cf. Fig 2). Increasing K changes not only how many neighbours a genotype has in the genotype space, but it affects the global topology of the network of genotypes. Since accessibility of the global fitness maximum is a global property of the fitness landscape, it depends on K. This is in contrast to local properties such as the number of local fitness maxima, which depend only on the number of nearest neighbours and is thus the same for FLs with the same value of the product (K − 1)L [17].
The presence of multiple alleles has also one more important effect—it enables pathways to explore a much larger part of the fitness landscape through mutations that either do not decrease the Hamming distance to the target genotype or even increase it (backward mutations). This greatly enhances accessibility by exploring genotypes through indirect mutations, effectively circumventing regions of sign epistasis [46–48]. This observation has been recently supported by experimental study of the four-site IgG-binding domain of protein G (L = 4, K = 20) [49] which shows that direct paths are often blocked by reciprocal sign epistasis, and only indirect pathways can access high-fitness regions of the landscape.
By re-sampling two experimental fitness landscapes [39, 42] we have found that the presence of correlations between the fitnesses facilitates evolution and, as a result, accessibility and the fraction of accessible genotypes for K = 2 increases to a level comparable with that of the maximally-random FL with K > 2. For larger K the increase of accessibility is not so pronounced (accessibility is already high in the maximally random FL), but the number of accessible pathways increases by many orders of magnitudes: there are ∼1010 pathways in the presence of correlations as compared to ∼103 pathways in the randomized FL with 1024 genotypes (S1 Table). It would be interesting to see if similar effects could be observed in other large FLs [50, 51]. However, these data sets, despite a large number of genotypes included, cover only a small fraction of the genotype space due to random mutagenesis employed to generate mutant genotypes. The problem of missing genotypes with undefined fitness is why in this work we decided to use only the (almost complete) FLs of Refs. [39, 42].
Our analysis assumes that all evolutionary pathways are equally likely. This does not have to be true in many biologically relevant scenarios [3, 52, 53]. Even in the strong selection/weak mutation (SSWM) limit to which the notion of accessible pathways is most relevant, pathways of different lengths have different probabilities of realisation. In particular, for K = 2 shortest pathways (if they exist) may have higher probability of realisation than more abundant but longer indirect pathways [54]. Our work shows however that for K > 2 indirect pathways completely dominate the set of accessible pathways (cf. S3 Fig), which suggests that indirect accessible pathways with moderate lengths may be relevant in the SSWM limit. It would be interesting to see how these predictions change once the SSWM assumption is relaxed, effectively decreasing the relevance of pathways with increasing fitness [55].
Another factor that we neglected was that genotype-to-phenotype mapping is not unique due to phenotypic plasticity [56–58] in which many phenotypes may correspond to the same genotype. Moreover, fitness is not an absolute characteristic of an organism in the same way as the genotype is because it depends on the organism’s environment: physical conditions such as temperature, available nutrients, and the presence/absence of other organisms. This environmental plasticity has been shown to speed up evolution in experiments [59–61] and in silico [62–64]. Fitness can also be different at different locations even in a relatively small habitat, and chemical gradients can significantly affect the rate of biological evolution [65, 66]. The reality is thus much more complicated than our idealised model, but the picture that emerges from our work is that the mutational structure of the genotype space is an important factor affecting biological evolution, and that the use of binary sequences to represent real genotypes has limited applicability.
Methods
Computer model
For each generated fitness landscape (FL) we find the number of accessible pathways that start from the antipodal genotype {0, 0, …0} and end at the fittest (target) genotype {K − 1, K − 1, …, K − 1} using a depth-first search (DFS) algorithm with backtracking. The fitness landscape is represented as a graph with nodes corresponding to genotypes and edges to mutations between the pairs of genotypes. Each node is assigned a counter variable which specifies how many accessible pathways connect this node to the fittest genotype, and an auxiliary variable v which assumes one of the three possible node states: v = 0, v = 1 or v = 2, representing nodes that have not been visited by DFS yet, the visited ones, and the ones with their whole mutational neighbourhood explored. All counters and variables v are initially set to zero. The DFS algorithm starts from the antipodal genotype and follows only those edges along which fitness increases. When the final node (genotype with maximal fitness) has been reached the algorithm traces its steps back to the initial node while increasing the counters of all nodes it visits on its way. Particularly, if all edges from a given node have been explored, the node is labelled as explored (v = 2) and its counter variable is no longer updated. Next, the counter of this node is added to the counter of the previous node in the DFS and the search continues from the later node. If at some point the DFS algorithm attempts to visit an already explored node, only the explored node counter is added to the counter of the current node. The backtracking procedure thus exhaustively accumulates the number of all paths going from a given node to the target genotype. When the search is completed, the value of the counter at the initial node gives the total number of accessible paths starting at this node.
We tested this algorithm by comparing the number of pathways found for small (N < 210) FLs with the number obtained by a naive, recursive breath-first search algorithm. Both algorithms produced identical results but the first algorithm was substantially faster for larger graphs than the naive algorithm. For example, the described algorithm enabled us to enumerate all pathways (∼1034) on a computer-generated FL with K = 16, L = 6 which would clearly be impossible for the naive algorithm.
To obtain the accessibility of a given FL we used a simple DFS algorithm without backtracking. This reduced memory requirements as compared to the full algorithm described above, enabling FLs with higher K and/or L to be analysed. To obtain the number of forward, backward and distance-neutral mutations in pathways of a given length, we modified the algorithm so that it stored a two-dimensional histogram—a generalized counter variable—for each node. During backtracking the histograms were accumulated and stored as the final histogram in the antipodal genotype (in S6 Fig example of such histogram is plotted for the experimental FL with K = 4 and L = 10).
Experimental fitness landscape
The first experimental FL we used was the protein-binding landscape of DNA oligomers of length L = 10 [39]. This landscape has been previously shown to be rugged, with many minima and maxima, and is thus not too dissimilar from our computer-modelled landscapes. However, the landscape shows strong correlations (Fig. 2 in Ref. [39]) and thus it is not evident how our results from the computer model of maximally-random, uncorrelated FLs will apply to this landscape.
The natural number of coding units, or “alleles”, for this landscape is K = 4 since each site of the sequence can be in one of four possible states (4 nucleotides A,T,G,C). The total number of genotypes is 410 ≈ 1M. In order to obtain quasi-independent fitness landscapes necessary for statistical analysis from just one, large data set and FLs for other values of K (in particular for K = 2), we followed the procedure described below.
To obtain a sample landscape with K = 4, L = 5, we randomly selected 5 different positions from the 10-nucleotide sequence, and a random sequence where ci was one of A,T,G, or C. We then selected all 10-nucleotide sequences with nucleotides at positions , and any nucleotides at the remaining L = 5 positions. This generated a sub-landscape of the full fitness landscape with N = 45 = 1024 genotypes with the same “genetic background” at positions . By repeating the above procedure we created 10k (of the total ≈ 258k different) FLs with different genetic backgrounds (different ).
To generate a sample landscape with K = 2, L = 10, we selected a random sequence from all 10-nucleotide sequences, and another sequence such that it differed from at all positions. We associated sequences with two antipodal binary sequences 00…0 and 11…1 of length L = 10, and constructed the fitness landscape from all N = 2L = 1024 sequences such that the i-th nucleotide was either xi or yi. We repeated this procedure to sample 10k landscapes of the total ≈ 60M different possible FLs.
After generating the sets of fitness landscapes with K = 4 and K = 2 with the same number of genotypes N = 1024, the corresponding fitness values were normalized to be between 0 and 1. Accessibility was determined for mutational pathways starting at the antipodal genotype and ending at the genotype with maximum fitness. For K = 4 we used the same convention of selecting the antipodal genotype as in the computer-generated FLs (see below).
For our second FL we used the rate constants for the reaction between the RNA-binding subunit C5 of RNase P recognition and its RNA substrate, with all variants of its recognition site of length L = 6 [42]. Using the same approach as described in the Methods of Ref. [42] we calculated the rates relative to the rate constant of the wild-type RNA sequence. In our analysis we were less stringent than Ref. [42] on removing sequence variants that exhibited biased amplification, i.e., we kept sequences that had reads above quality threshold for at least three time points. We took as fitness the relative rate constant averaged over different time points; we also checked that other methods, e.g. selecting with the largest difference in the corresponding raw read count, did not significantly affect our results. The analysis yielded fitness values for 3940 out of the total 46 = 4096 possible sequences. The remaining 4% sequences for which fitness could not be determined were assigned the fitness of the least fit genotype.
Adapting the procedure described for the first experimental landscape we obtained sub-landscapes of the full experimental landscape with K = 4, L = 3 and K = 2, L = 6, with N = 64 genotypes each. Since these sub-FLs were sampled from the landscape smaller than the first one and the total number of possible sub-FLs was much smaller than before (e.g., 1280 for K = 4), we generated only 1k sub-FLs for each K. We also normalized the fitness to be between 0 and 1, and performed the same analysis as for the first experimental data set.
Re-indexing fitness landscapes
In both the computer model and the analysis of the real FLs we re-index all genotypes such that the best-fit (target) genotype is assigned the sequence {K − 1, K − 1, …, K − 1}. The antipodal genotype is then {0, 0, …0}. When fitness values are drawn at random no actual re-indexing is necessary but the series of N = KL fitnesses is generated and the highest value is assigned to the target genotype while the remaining fitness values are assigned to the remaining genotypes. For sub-landscapes of the real fitness landscapes we define a map that assigns the target genotype sequence to the fittest genotype while preserving all correlations and the structure of the FL; in particular, each genotype has the same nearest neighbours in the re-indexed FL. For K = 2 we proceed as follows: if the fittest genotype sequence has 0s at k positions and 1s at the remaining positions, the map changes 0 → 1 and 1 → 0 at the positions and for the remaining positions there are no changes, see S7 Fig. Applying this transformation to all genotypes results in a FL with the desired indexing (the fittest genotype sequence contains only 1s and the antipodal genotype sequence only 0s). This mapping is unique for K = 2, but in the case of K > 2 the antipodal genotype can be selected in many ways. We thus modify the algorithm as follows for K > 2. Assume that the fittest genotype sequence has characters at positions such that 0 ≤ c1, c2, …, ck < K − 1. We then define the following reassignment: c1 → K − 1, …, ck → K − 1, which maps the fittest genotype to the sequence {K − 1, K − 1, …, K − 1}. There are now (K − 1)L possible choices for the antipodal genotype and we remove this ambiguity by randomly selecting one of them and extending the mapping (character reassignment) in such a way that the selected genotype is mapped onto the sequence {0, 0, …0}. The character reassignment rules described above are then applied to all remaining genotypes. We note that these rules apply only to specific positions in the sequence; for example the mapping 1 → K − 1 at position 3 will be applied only if a sequence has character 1 at position 3, but not if there is a 1 at another position, unless there is a rule 1 → K − 1 specifically for that position. We also note that the algorithm does not change characters which are neither assigned to (K − 1) (target) nor to 0 (antipodal).
Equivalently to re-indexing sub-landscapes of the real fitness landscape one could adapt the DFS algorithm and related structures storing accessible pathways to have arbitrary target and initial genotypes, i.e., not encoded by sequences {K − 1, K − 1, …, K − 1} and {0, 0, …0}. The outcome of both approaches would be identical, as the fitness landscapes before and after re-indexing are isomorphic (S7 Fig).
Supporting Information
Acknowledgments
We thank Marjon de Vos and Oliver Martin for critically reading the manuscript.
Data Availability
All data and computer programs are available in Supporting Information (S1 Data and Code).
Funding Statement
MZ acknowledges the Polish National Science Centre grant no. DEC-2012/07/N/NZ2/00107. BW was supported by the Scottish Government/Royal Society of Edinburgh Personal Research Fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Wright S. The roles of mutation, inbreeding, crossbreeding, and selection in evolution. Proceedings of the Sixth International Congress of Genetics. 1932;1:356–366. [Google Scholar]
- 2. Orencia MC, Yoon JS, Ness JE, Stemmer WPC, Stevens RC. Predicting the emergence of antibiotic resistance by directed evolution and structural analysis. Nature Structural & Molecular Biology. 2001;8(3):238–242. 10.1038/84981 [DOI] [PubMed] [Google Scholar]
- 3. Weinreich DM, Delaney NF, DePristo MA, Hartl DL. Darwinian evolution can follow only very few mutational paths to fitter proteins. Science. 2006;312(5770):111–114. 10.1126/science.1123539 [DOI] [PubMed] [Google Scholar]
- 4. Martinez JL, Baquero F, Andersson DI. Predicting antibiotic resistance. Nature Reviews Microbiology. 2007;5(12):958–965. 10.1038/nrmicro1796 [DOI] [PubMed] [Google Scholar]
- 5. Novais Â, Comas I, Baquero F, Cantón R, Coque TM, Moya A, et al. Evolutionary trajectories of β-lactamase CTX-M-1 cluster enzymes: Predicting antibiotic resistance. PLoS Pathogens. 2010;6(1):e1000735 10.1371/journal.ppat.1000735 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Toprak E, Veres A, Michel JB, Chait R, Hartl DL, Kishony R. Evolutionary paths to antibiotic resistance under dynamically sustained drug selection. Nature Genetics. 2012;44(1):101–105. 10.1038/ng.1034 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Rodrigues JV, Bershtein S, Li A, Lozovsky ER, Hartl DL, Shakhnovich EI. Biophysical principles predict fitness landscapes of drug resistance. Proceedings of the National Academy of Sciences. 2016;113(11):E1470–E1478. 10.1073/pnas.1601441113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Bennett GM, Moran NA. Small, smaller, smallest: The origins and evolution of ancient dual symbioses in a phloem-feeding insect. Genome Biology and Evolution. 2013;5(9):1675–1688. 10.1093/gbe/evt118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Carneiro M, Hartl DL. Adaptive landscapes and protein evolution. Proceedings of the National Academy of Sciences. 2010;107(suppl 1):1747–1751. 10.1073/pnas.0906192106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Szendro IG, Schenk MF, Franke J, Krug J, de Visser JAGM. Quantitative analyses of empirical fitness landscapes. Journal of Statistical Mechanics: Theory and Experiment. 2013;2013(01):P01005 10.1088/1742-5468/2013/01/P01005 [DOI] [Google Scholar]
- 11. de Visser JAGM, Krug J. Empirical fitness landscapes and the predictability of evolution. Nature Reviews Genetics. 2014;15(7):480–490. 10.1038/nrg3744 [DOI] [PubMed] [Google Scholar]
- 12. Palmer AC, Toprak E, Baym M, Kim S, Veres A, Bershtein S, et al. Delayed commitment to evolutionary fate in antibiotic resistance fitness landscapes. Nature Communications. 2015;6 10.1038/ncomms8385 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Marcusson LL, Frimodt-Møller N, Hughes D. Interplay in the selection of fluoroquinolone resistance and bacterial fitness. PLoS Pathogens. 2009;5(8):e1000541 10.1371/journal.ppat.1000541 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Fontana W, Schuster P. A computer model of evolutionary optimization. Biophysical Chemistry. 1987;26(2):123–147. 10.1016/0301-4622(87)80017-0 [DOI] [PubMed] [Google Scholar]
- 15. Court SJ, Waclaw B, Allen RJ. Lower glycolysis carries a higher flux than any biochemically possible alternative. Nature Communications. 2015;6:8427 10.1038/ncomms9427 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Ullrich A, Rohrschneider M, Scheuermann G, Stadler P, Flamm C. In silico evolution of early metabolism. Artificial life. 2011;17(2):87–108. 10.1162/artl_a_00021 [DOI] [PubMed] [Google Scholar]
- 17. Kauffman S, Levin S. Towards a general theory of adaptive walks on rugged landscapes. Journal of Theoretical Biology. 1987;128(1):11–45. 10.1016/S0022-5193(87)80029-2 [DOI] [PubMed] [Google Scholar]
- 18. Kauffman SA, Weinberger ED. The NK model of rugged fitness landscapes and its application to maturation of the immune response. Journal of Theoretical Biology. 1989;141(2):211–245. 10.1016/S0022-5193(89)80019-0 [DOI] [PubMed] [Google Scholar]
- 19. Aita T, Uchiyama H, Inaoka T, Nakajima M, Kokubo T, Husimi Y. Analysis of a local fitness landscape with a model of the rough Mt. Fuji-type landscape: Application to prolyl endopeptidase and thermolysin. Biopolymers. 2000;54(1):64–79. 10.1002/(SICI)1097-0282(200007)54:1%3C64::AID-BIP70%3E3.0.CO;2-R [DOI] [PubMed] [Google Scholar]
- 20. Gavrilets S. A dynamical theory of speciation on holey adaptive landscapes. The American Naturalist. 1999;154(1):1–22. 10.1086/303217 [DOI] [PubMed] [Google Scholar]
- 21. Gavrilets S, Gravner J. Percolation on the fitness hypercube and the evolution of reproductive isolation. Journal of Theoretical Biology. 1997;184(1):51–64. 10.1006/jtbi.1996.0242 [DOI] [PubMed] [Google Scholar]
- 22. Fontana W, Stadler PF, Bornberg-Bauer EG, Griesmacher T, Hofacker IL, Tacker M, et al. RNA folding and combinatory landscapes. Physical Review E. 1993;47:2083–2099. 10.1103/PhysRevE.47.2083 [DOI] [PubMed] [Google Scholar]
- 23. Ancel LW, Fontana W. Plasticity, evolvability, and modularity in RNA. Journal of Experimental Zoology. 2000;288(3):242–283. 10.1002/1097-010X(20001015)288:3%3C242::AID-JEZ5%3E3.0.CO;2-O [DOI] [PubMed] [Google Scholar]
- 24. Cowperthwaite MC, Meyers LA. How mutational networks shape evolution: lessons from RNA models. Annual review of ecology, evolution, and systematics. 2007; p. 203–230. 10.1146/annurev.ecolsys.38.091206.095507 [DOI] [Google Scholar]
- 25. Maynard Smith J. Natural selection and the concept of a protein space. Nature. 1970;225(5232):563–564. 10.1038/225563a0 [DOI] [PubMed] [Google Scholar]
- 26. Babajide A, Hofacker IL, Sippl MJ, Stadler PF. Neutral networks in protein space: A computational study based on knowledge-based potentials of mean force. Folding and Design. 1997;2(5):261–269. 10.1016/S1359-0278(97)00037-0 [DOI] [PubMed] [Google Scholar]
- 27. Povolotskaya IS, Kondrashov FA. Sequence space and the ongoing expansion of the protein universe. Nature. 2010;465(7300):922–926. 10.1038/nature09105 [DOI] [PubMed] [Google Scholar]
- 28. Gillespie JH. Some properties of finite populations experiencing strong selection and weak mutation. The American Naturalist. 1983;121(5):691–708. 10.1086/284095 [DOI] [Google Scholar]
- 29. Gillespie JH. Molecular evolution over the mutational landscape. Evolution. 1984;38(5):1116–1129. 10.2307/2408444 [DOI] [PubMed] [Google Scholar]
- 30. Franke J, Klözer A, de Visser JAGM, Krug J. Evolutionary accessibility of mutational pathways. PLoS computational biology. 2011;7(8):e1002134 10.1371/journal.pcbi.1002134 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Hegarty P, Martinsson A. On the existence of accessible paths in various models of fitness landscapes. The Annals of Applied Probability. 2014;24(4):1375–1395. 10.1214/13-AAP949 [DOI] [Google Scholar]
- 32. Berestycki J, Brunet É, Shi Z. Accessibility percolation with backsteps. ArXiv e-prints. 2014. [Google Scholar]
- 33. Huynen MA, Stadler PF, Fontana W. Smoothness within ruggedness: The role of neutrality in adaptation. Proceedings of the National Academy of Sciences. 1996;93(1):397 10.1073/pnas.93.1.397 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Greenbury SF, Schaper S, Ahnert SE, Louis AA. Genetic correlations greatly increase mutational robustness and can both reduce and enhance evolvability. PLOS Comput Biol. 2016;12(3):e1004773 10.1371/journal.pcbi.1004773 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Fontana W. Continuity in evolution: on the nature of transitions. Science. 1998;280(5368):1451–1455. 10.1126/science.280.5368.1451 [DOI] [PubMed] [Google Scholar]
- 36. Graner Fmc, Glazier JA. Simulation of biological cell sorting using a two-dimensional extended Potts model. Physical Review Letters. 1992;69:2013–2016. 10.1103/PhysRevLett.69.2013 [DOI] [PubMed] [Google Scholar]
- 37. Otwinowski J, Nemenman I. Genotype to phenotype mapping and the fitness landscape of the E. coli lac promoter. PLOS ONE. 2013;8(5):e61570 10.1371/journal.pone.0061570 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Kouyos RD, Leventhal GE, Hinkley T, Haddad M, Whitcomb JM, Petropoulos CJ, et al. Exploring the complexity of the HIV-1 fitness landscape. PLOS Genetics. 2012;8(3):e1002551 10.1371/journal.pgen.1002551 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Rowe W, Platt M, Wedge DC, Day PJ, Kell DB, Knowles J. Analysis of a complete DNA-protein affinity landscape. Journal of The Royal Society Interface. 2010;7(44):397–408. 10.1098/rsif.2009.0193 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Nowak S, Krug J. Accessibility percolation on n-trees. EPL (Europhysics Letters). 2013;101(6):66004 10.1209/0295-5075/101/66004 [DOI] [Google Scholar]
- 41. Berestycki J, Brunet E, Shi Z. The number of accessible paths in the hypercube. Bernoulli. 2016;22(2):653–680. 10.3150/14-BEJ641 [DOI] [Google Scholar]
- 42. Guenther UP, Yandek LE, Niland CN, Campbell FE, Anderson D, Anderson VE, et al. Hidden specificity in an apparently nonspecific RNA-binding protein. Nature. 2013;502(7471):385–388. 10.1038/nature12543 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Saunders R, Deane CM. Synonymous codon usage influences the local protein structure observed. Nucleic Acids Research. 2010;38(19):6719–6728. 10.1093/nar/gkq495 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Novoa EM, de Pouplana LR. Speeding with control: codon usage, tRNAs, and ribosomes. Trends in Genetics. 2012;28(11):574–581. 10.1016/j.tig.2012.07.006 [DOI] [PubMed] [Google Scholar]
- 45. O’Brien EP, Vendruscolo M, Dobson CM. Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates. Nature Communications. 2014; 5 10.1038/ncomms3988 [DOI] [PubMed] [Google Scholar]
- 46. Weinreich DM, Watson RA, Chao L. Perspective: sign epistasis and genetic constraint on evolutionary trajectories. Evolution. 2005;59(6):1165–1174. 10.1111/j.0014-3820.2005.tb01768.x [DOI] [PubMed] [Google Scholar]
- 47. Poelwijk FJ, Tănase-Nicola S, Kiviet DJ, Tans SJ. Reciprocal sign epistasis is a necessary condition for multi-peaked fitness landscapes. Journal of Theoretical Biology. 2011;272(1):141–144. 10.1016/j.jtbi.2010.12.015 [DOI] [PubMed] [Google Scholar]
- 48. Dawid A, Kiviet DJ, Kogenaru M, de Vos M, Tans SJ. Multiple peaks and reciprocal sign epistasis in an empirically determined genotype-phenotype landscape. Chaos. 2010;20(2). 10.1063/1.3453602 [DOI] [PubMed] [Google Scholar]
- 49. Wu NC, Dai L, Olson CA, Lloyd-Smith JO, Sun R. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife. 2016;5:e16965 10.7554/eLife.16965 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Puchta O, Cseke B, Czaja H, Tollervey D, Sanguinetti G, Kudla G. Network of epistatic interactions within a yeast snoRNA. Science. 2016. 10.1126/science.aaf0965 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Sarkisyan KS, Bolotin DA, Meer MV, Usmanova DR, Mishin AS, Sharonov GV, et al. Local fitness landscape of the green fluorescent protein. Nature. 2016;533(7603):397–401. 10.1038/nature17995 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Orr HA. The genetic theory of adaptation: a brief history. Nature Reviews Genetics. 2005;6(2):119–127. 10.1038/nrg1523 [DOI] [PubMed] [Google Scholar]
- 53. Lozovsky ER, Chookajorn T, Brown KM, Imwong M, Shaw PJ, Kamchonwongpaisan S, et al. Stepwise acquisition of pyrimethamine resistance in the malaria parasite. Proceedings of the National Academy of Sciences. 2009;106(29):12025–12030. 10.1073/pnas.0905922106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Roy SW. Probing evolutionary repeatability: Neutral and double changes and the predictability of evolutionary adaptation. PLoS ONE. 2009;4(2):1–6. 10.1371/journal.pone.0004500 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Szendro IG, Franke J, de Visser JAGM, Krug J. Predictability of evolution depends nonmonotonically on population size. Proceedings of the National Academy of Sciences. 2013;110(2):571–576. 10.1073/pnas.1213613110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Price TD, Qvarnström A, Irwin DE. The role of phenotypic plasticity in driving genetic evolution. Proceedings of the Royal Society of London B: Biological Sciences. 2003;270(1523):1433–1440. 10.1098/rspb.2003.2372 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Feinberg AP. Phenotypic plasticity and the epigenetics of human disease. Nature. 2007;447(7143):433–440. 10.1038/nature05919 [DOI] [PubMed] [Google Scholar]
- 58. Koonin EV, Wolf YI. Constraints and plasticity in genome and molecular-phenome evolution. Nature Reviews Genetics. 2010;11(7):487–498. 10.1038/nrg2810 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. de Vos MGJ, Dawid A, Sunderlikova V, Tans SJ. Breaking evolutionary constraint with a tradeoff ratchet. Proceedings of the National Academy of Sciences. 2015;112(48):14906–14911. 10.1073/pnas.1510282112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. de Vos MGJ, Poelwijk FJ, Battich N, Ndika JDT, Tans SJ. Environmental dependence of genetic constraint. PLoS Genetics. 2013;9(6):1–8. 10.1371/journal.pgen.1003580 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Stiffler MA, Hekstra DR, Ranganathan R. Evolvability as a function of purifying selection in TEM-1 β-lactamase. Cell. 2015;160(5):882–892. 10.1016/j.cell.2015.01.035 [DOI] [PubMed] [Google Scholar]
- 62. Kashtan N, Noor E, Alon U. Varying environments can speed up evolution. Proceedings of the National Academy of Sciences. 2007;104(34):13711–13716. 10.1073/pnas.0611630104 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Mustonen V, Lässig M. Molecular evolution under fitness fluctuations. Physical Review Letters. 2008;100:108101 10.1103/PhysRevLett.100.108101 [DOI] [PubMed] [Google Scholar]
- 64. Bitbol AF, Schwab DJ. Quantifying the role of population subdivision in evolution on rugged fitness landscapes. PLoS Computational Biology. 2014;10(8):1–15. 10.1371/journal.pcbi.1003778 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Zhang Q, Lambert G, Liao D, Kim H, Robin K, Tung Ck, et al. Acceleration of emergence of bacterial antibiotic resistance in connected microenvironments. Science. 2011;333(6050):1764–1767. 10.1126/science.1208747 [DOI] [PubMed] [Google Scholar]
- 66. Greulich P, Waclaw B, Allen RJ. Mutational pathway determines whether drug gradients accelerate evolution of drug-resistant cells. Physical Review Letters. 2012;109(8). 10.1103/PhysRevLett.109.088101 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data and computer programs are available in Supporting Information (S1 Data and Code).