Abstract
Mutations in non-coding regulatory DNA sequences can alter gene expression, organismal phenotype, and fitness1–3. Constructing complete fitness landscapes, mapping DNA sequences to fitness, is a long-standing goal in biology, but has remained elusive because it is challenging to generalize reliably to vast sequence spaces4–6. Here, we construct sequence-to-expression models that capture fitness landscapes and use them to decipher principles of regulatory evolution. Using millions of randomly-sampled promoter DNA sequences and their measured expression levels in the yeast Saccharomyces cerevisiae, we learn deep neural network models that generalize with excellent prediction performance, and enable sequence design for expression engineering. Using our models, we study expression divergence under genetic drift and strong-selection weak-mutation regimes to find that regulatory evolution is rapid and subject to diminishing returns epistasis, that conflicting expression objectives in different environments constrain expression adaptation, and that stabilizing selection on gene expression leads to the moderation of regulatory complexity. We present an approach for using our models to detect signatures of selection on expression from natural variation in regulatory sequences and use it to discover an instance of convergent regulatory evolution. We assess mutational robustness, finding that regulatory mutation effect sizes follow a power law, characterize regulatory evolvability, visualize promoter fitness landscapes, discover evolvability archetypes and highlight the mutational robustness of natural regulatory sequence populations. Our work provides a general framework for addressing fundamental questions in regulatory evolution.
INTRODUCTION
Changes in cis-regulatory elements (CREs) play a major role in the evolution of gene expression1. Mutations in CREs can affect their interactions with transcription factors (TFs), change the timing, location, and level of gene expression, and impact organismal phenotype and fitness2,3. While TFs evolve slowly because they each regulate many target genes, CREs evolve much faster and are thought to drive substantial phenotypic variation7. Thus, understanding how cis-regulatory sequence variation affects gene expression, phenotype and organismal fitness is fundamental to our understanding of regulatory evolution2.
A fitness function maps genotypes (which vary through mutations) to their corresponding organismal fitness values (where selection operates)8. A complete fitness landscape9 is defined by a fitness function that maps each sequence in a sequence space to its associated fitness, coupled with an approach for visualizing the sequence space. Partial fitness landscapes have been characterized empirically4,5,10, often defining fitness as the maximum growth rate of single-cell organisms4,11. Many recent empirical fitness landscape studies of proteins12, adeno-associated viruses13, catalytic RNAs14, promoters15, and TF binding sites16 have favored molecular activities as fitness proxies because they are less susceptible to experimental biases and measurement noise17. In particular, the molecular activity of a promoter sequence as reflected in the expression of the regulated gene has been used to build a ‘promoter fitness landscape’18. However, despite advances in high-throughput measurements, empirical fitness landscape studies often sample sequences in the local neighborhood of natural ones and thus remain limited to a tiny subset of the complete sequence space whose size grows exponentially with sequence length (4L for DNA or RNA, where L is the length of sequence)4–6.
Understanding the relationship between promoter sequence, expression phenotype, and fitness would allow us to answer fundamental questions6 in evolution and gene regulation, and provide an invaluable bioengineering tool6,19. A model that accurately approximates the relationship between sequence and expression can serve as an “oracle” in evolutionary studies to conduct and interpret in-silico experiments20–23, predict which regulatory mutations affect expression and fitness (when coupled with expression-to-fitness curves11), design or evolve new sequences with desired characteristics, determine how quickly selection achieves an expression optimum, identify signatures of selective pressures on extant regulatory sequences, visualize fitness landscapes and characterize mutational robustness and evolvability2,4–6,24,25.
Here, we address these long-standing problems by developing a framework for studying regulatory evolution and fitness landscapes (Fig. 1a) based on Saccharomyces cerevisiae promoter sequence-to-expression models.
RESULTS
Models predict expression from sequence
We begin by building models that predict gene expression given an 80 bp promoter DNA sequence. To train these models, we measure the expression driven by promoter sequences using an approach we previously described26, where 80 bp of DNA are embedded within a promoter construct and the associated expression is assayed in the S. cerevisiae (Methods). We clone promoter sequences into an episomal low copy number YFP expression vector, transform them into yeast, culture the yeast in the desired media, sort the yeast into 18 expression bins, and sequence the promoters present from the yeast in each bin to estimate expression (Methods and Supplementary Information). To avoid biases5 towards extant sequences, we measured the expression of 80 bp random DNA sequences, where each base is randomly sampled from the four bases. For training data, we measured each of >30 million sequences in complex media (YPD, Methods) and >20 million sequences in defined media (SD-Ura, synthetic defined lacking uracil). Using the resulting pairs of sequences and measured YFP expression levels, we trained convolutional neural network models (“convolutional models”) that predict expression from sequence in each medium (Methods).
To show that the learned convolutional models generalize to new sequences, we predicted the expression for several sets of test sequences not seen during model training, and compared them to their experimentally measured levels (Methods). For these test sequences, we quantified expression in independent experiments using the same experimental approach and in the same media. Our convolutional models had excellent prediction performance on native yeast promoter test sequences (Pearson’s r = 0.960, P < 5*10−324, n=61,150; Fig. 1b), and on multiple other test sets in both complex and defined media (Extended Data Fig. 1).
These results represent a ~45% decrease in error compared to the performance of biochemical models we previouly26 trained on the same data (complex media; native yeast promoter test sequences; Supplementary Notes and Methods). Other published genomic model architectures adapted to and trained using our data also had excellent performance (Supplementary Fig. 4a), highlighting the predictive power of deep neural network models trained using our large-scale data. Finally, the expression measurements were highly correlated for the same sequences between the two media (Pearson’s r = 0.978, Extended Data Fig. 2a) and models trained on defined medium predicted expression in complex medium well (Pearson’s r = 0.966, Extended Data Fig. 2b). However, for some sequences we expect differences between growth conditions (below).
Models enable expression engineering
We leveraged the high predictive performance of our convolutional models for a synthetic biology application of gene expression engineering, by using model predictions as a ‘fitness function’ for genetic algorithms (GA) to design sequences with extreme expression values. We initialized the GA with a population of 100,000 randomly-generated samples from the sequence space, and simulated 10 generations to maximize (or minimize) the expression output from the convolutional model (Methods). We then synthesized the 500 sequences with the top predicted maximum (or minimum) expression levels and tested them experimentally. The GA-designed sequences drove, on average, more extreme expression than >99% of native sequences (99.6% for high expressing; 99.3% for low), with ~20% of designed sequences yielding more extreme expression than any native sequence tested (23.5% for high; 18.4% for low) (Fig. 1c). Thus, our sequence-to-expression model can be used for gene expression engineering.
Expression diverges under genetic drift
We next assessed the evolutionary malleability of expression under different evolutionary scenarios: random genetic drift, stabilizing selection, and directional selection for extreme expression levels (Fig. 2). In each case, we first simulated the scenario, using our convolutional model to predict the expression for each sequence, and then tested the model’s evolved sequences experimentally, where possible (Methods).
We first simulated random genetic drift of regulatory sequences, with no selection on expression levels. We randomly introduced a single mutation in each random starting sequence, repeated this process for multiple consecutive generations, and used our convolutional model to predict the difference in expression between the mutated sequences in each trajectory relative to the corresponding starting sequence (Fig. 2a-c). Expression levels diverged as the number of mutations increased, with 32 mutations in the 80 bp region resulting in nearly as different expression from the original sequence as two unrelated sequences (Fig. 2b). We validated our results experimentally by synthesizing sequences with zero to three random mutations and measuring their expression in our assay (Methods). The experimental measurements closely matched our predictions in both complex (Fig. 2c) and defined (Extended Data Fig. 1e) media, both in expression change (Pearson’s r: 0.869 and 0.847, respectively; Extended Data Fig. 1h,i) and level (Pearson’s r: 0.973 and 0.963 respectively; Extended Data Fig. 1l,m).
Stabilizing selection tempers complexity
Although gene regulatory networks often appear to be highly interconnected26,27, the sources of this regulatory complexity and how it changes with the turnover of regulatory mechanisms28 remain unclear. We used our model to study the evolution of regulatory complexity in the context of stabilizing selection, which favors the maintenance of existing expression levels. We first quantified regulatory complexity, defined as 1 minus the Gini coefficient (a measure of inequality of continuous values within a population) of TF regulatory interaction strengths. For this, we used an interpretable biochemical model we previously developed26 (Methods) because it has parameters that explicitly correspond to TFs, and we can directly query their contributions to model predictions. Next, starting with native sequences whose regulatory complexity is either extremely high (many TFs with similar contributions to expression) or low (few TFs contribute disproportionately to expression) and spanning a range of expression levels, we introduced single mutations into each starting native sequence for each of 32 consecutive generations, identified the sequences that conserved the original expression level using the convolutional model, and selected one of them at random for the next generation. We then assessed the regulatory complexity of the evolved sequences.
As random mutations accumulated, the regulatory complexity of sequences starting at both complexity extremes shifted towards moderate complexities (Fig. 2d, rightmost blue and orange), closer to the averages for both random and native sequences (Fig. 2d, greys). This suggests that stabilizing selection on expression leads to a moderation of regulatory complexity, resulting from gradual drift in the roles of the different regulators, such as an increase in complexity due to a decrease in the relative contribution of one predominant TF (e.g. Abf1p for AIF1), or a decrease in complexity through smaller changes in a much larger number of sites (e.g. YDR476C; Supplementary Fig. 8). The overall distribution of regulatory complexity of native yeast promoters is similar to that of random sequences (Fig. 2d, grey boxes), suggesting that there is little selection on the regulatory complexity of native sequences in a single environment.
Strong selection rapidly finds extrema
To study the impact of directional selection on expression, we simulated the strong-selection weak-mutation (SSWM) regime29 (Fig. 2e, Methods), where each mutation is either beneficial or deleterious (strong selection, with mutations surviving drift and fixing in an asexual population), and mutation rates are low enough to only consider single base substitutions during adaptive walks (weak mutation). Starting with a set of native promoter sequences, at each iteration (generation), for a given starting sequence of length L, we considered all of its 3L single-base mutational neighbors, used our convolutional model to predict their expression, and took the sequence with the largest increase (or separately, decrease) in expression at each iteration (generation) as the starting sequence for the next generation (Fig. 2e, Methods).
Sequences that started with diverse initial expression levels rapidly evolved to high (or separately, low) expression, with the vast majority evolving close to saturating extreme levels within 3–4 mutations in both the complex (Fig. 2f) and defined (Extended Data Fig. 1f) media. Sequences took diverse paths to evolve either high or low expression (Supplementary Fig. 7). We validated these trajectories experimentally for select series of sequences (Fig. 2g, Extended Data Fig. 1g), measuring the expression driven by synthesized sequences from several generations along simulated mutational trajectories for complex media (10,322 sequences from 877 trajectories) and defined media (6,304 sequences from 637 trajectories). We observed extreme expression within 3–4 mutational steps, with high agreement between measured and predicted expression change (Extended Data Fig. 1j,k; Pearson’s r: 0.977 and 0.948, respectively) and expression levels (Extended Data Fig. 1n,o; Pearson’s r: 0.980 and 0.963) along the trajectories in both complex and defined media. Thus, cis-regulatory sequence evolution is rapid and subject to diminishing returns epistasis30.
Opposing objectives constrain adaptation
In contrast to the rapid evolution towards expression extremes, we found that evolution to satisfy two opposing expression requirements (one in each growth media) was more constrained. A concrete example is the expression of the URA3 gene: organismal fitness increases with increased URA3 expression in defined media lacking uracil, because Ura3p is required for uracil biosynthesis, but fitness decreases with increased URA3 expression in complex media containing 5-FOA due to Ura3p-mediated conversion of 5-FOA to toxic 5-fluorouracil (Extended Data Fig. 2c). To study this regime31, we started with a set of native promoter sequences (and separately, a set of random sequences) and used the convolutional model to simulate SSWM trajectories (Methods) that maximize the difference in expression between the two media (defined and complex). While the difference in expression increased with each generation (Extended Data Fig. 2d,e), the vast majority of sequences achieved neither the maximal nor the minimal expression in either condition after 10 generations (Fig. 2h, Extended Data Fig. 2f), for both native and random starting sequences. The evolved sequences became enriched for motifs for TFs involved in nutrient sensing and metabolism, compared to the starting sequences (Extended Data Fig. 2g), suggesting that the model is taking advantage of subtle differential activity of certain regulators between the two conditions to evolve condition specificity. Thus, while evolving a sequence to achieve a single expression optimum requires very few mutations, encoding multiple opposing objectives in the same sequence is more difficult, limiting expression adaptation.
Transformers enable inference at scale
We next turned to the evolution and evolvability of regulatory sequences in extant strains and species. This required us to predict expression for billions of sequences and, although our convolutional model had excellent predictive power, our implementation was limited in its scalability and incompatible with the Tensor Processing Units (TPUs), available to us for larger-scale computational tasks (Methods). To enable large-scale expression prediction, we developed “transformer” models that used transformer encoders32 with other building blocks attempting to implicitly capture known aspects of regulation33 (Methods, Supplementary Fig. 12). The transformer models had ~20x fewer parameters than the convolutional models (Methods, Supplementary Information), predicted expression as well as the convolutional models (Extended Data Fig. 3), and better captured the propensity for expression to plateau under SSWM (Supplementary Fig. 19). The convolutional and transformer models had highly correlated predictions in both media (Supplementary Fig. 4e-h, Pearson’s r=0.967–0.985), and yielded equivalent conclusions from the analyses of genetic drift, directional selection and conflicting objectives (Extended Data Fig. 3, Supplementary Fig. 17-18).
The Expression Conservation Coefficient
We applied our sequence-to-expression transformer model to detect evidence of selective pressures on natural regulatory sequences, inspired by the way in which the ratio of non-synonymous (“non-neutral”) to synonymous (“neutral”) substitutions (dN/dS) in protein coding sequences is used estimate the strength and mode of natural selection34. By analogy2,35, for regulatory sequences2, we used the transformer model to quantitatively assess the impact of naturally occurring regulatory genetic variation on expression, compared to that expected with random mutations, and summarized this with an Expression Conservation Coefficient (ECC) (Methods). To compute the ECC, we compared, for each gene’s promoter, the standard deviation of the expression distribution predicted by the transformer model for a set of naturally varying orthologous promoters (σB) to the standard deviation of the expression distribution predicted for a matched set of random variation introduced to that promoter (σC; related to the mutational variance36; Fig. 3a). We define the ECC for a gene as log(σc/σB), such that a positive ECC indicates stabilizing selection on expression (lower variance in native sequences than expected by chance), a negative ECC indicates diversifying (disruptive) selection or local adaptation (greater variance in native sequences), and values near 0 suggest neutral drift.
We calculated the ECC for 5,569 S. cerevisiae genes using the natural variation observed across over 4.73 million orthologous promoter sequences from the 1,011 S. cerevisiae isolates37 in the −160 to −80 regions (with respect to the Transcription Start Site (TSS)), a critical location for TF binding38 and determinant of promoter activity26 (Fig. 3a,b, Supplementary Table 1), using our transformer model to predict the expression for each sequence. To assess the robustness of the ECC values, we recomputed the ECC using multiple published sequence-to-expression model architectures that we adapted and trained using our data and found that models with similarly high predictive power resulted in similar ECC values (Supplementary Fig. 4b-d, 5g).
Over 70% of promoters had positive ECCs, suggesting stabilizing selection (and conserved expression) (binomial test P < 10−215) (Fig. 3b), consistent with previous reports based on direct measurements of gene expression39. Genes with high ECCs were enriched in highly-conserved core cellular processes (e.g., RNA and protein metabolism) (Fig. 3b, Supplementary Table 2), and those with low ECCs were most enriched in processes related to carboxylic acid and alcohol metabolism (Fig. 3b, Supplementary Table 2), potentially reflecting adaptation of fermentation genes to the diverse environments of these isolates37.
The ECC discovers convergent evolution
A striking example of predicted positive selection is the promoter of CDC36 (NOT2; ECC= −2.138, Fig. 3b), which has common natural alleles with either low or high (predicted) expression across the isolates (Fig. 3c). Analysis of CDC36 promoter sequences (Methods) suggests that low-expression evolved at least twice independently, resulting in two distinct variants with reduced expression (Fig. 3c, allele 1 and 2). Interrogation with the biochemical model26 to identify factors impacting these expression differences (Extended Data Fig. 4a) suggested that both low-expression alleles are explained by disruption of the same binding site for Upc2p, an ergosterol sensing TF (Fig. 3c). To validate this, we restored the putative Upc2p binding site in a strain (WE), where it is otherwise disrupted, and measured expression levels by qPCR and growth upon changing carbon source (Methods). Restoration of the Upc2p binding site increased actual expression, confirming the model’s prediction (Pearson’s r=0.96, p=0.039, n=4; Fig. 3d). We hypothesized that these variants could alter the rate of transcriptional reprogramming when changing environments via Cdc36p-regulated mRNA turnover40. Indeed, restoration of the Upc2 binding site reduced the strains’ lag time to growth when switching carbon sources (Fig. 3d, right; Methods), and they grew to a higher culture density (Supplementary Fig. 10). Thus, convergent evolution of the CDC36 promoter, discovered using the ECC, independently produced two alleles that result in similar perturbations to TF binding, expression, and growth.
ECC vs. cross-species RNAseq and fitness
ECC values were consistent with expression conservation as measured for yeast orthologs across clades at short (Saccharomyces), medium (Ascomycota), or long (mammals) evolutionary scales (Extended Data Fig. 4b). In Saccharomyces, 1:1 orthologs with conserved expression levels across species (as measured by RNA-seq41) had significantly higher ECC (computed from the 1,011 yeast isolates) than genes whose expression was not conserved (two-sided Wilcoxon rank-sum P = 3.1*10−4, Extended Data Fig. 4b, bottom left, Methods). Next, we performed RNA-seq across 11 Ascomycota yeast species (Methods), finding that 1:1 orthologs with conserved expression across Ascomycota had significantly higher ECC values (Extended Data Fig. 4b, bottom center, P = 1.16*10−6). Finally, the 1:1 orthologs of genes with high ECC values in the 1,011 S. cerevisiae isolates also had more conserved expression within mammals42 (Extended Data Fig. 4b, bottom right, P = 1.07*10−4, Methods). Thus, while 1:1 yeast-mammal orthologs are likely critical to an organism’s fitness, only a subset of these may be under stabilizing selection on expression, and this subset tends to be under such selection in both yeasts and mammals. Thus, the ECC quantifies stabilizing selection on expression in yeast and may predict stabilizing selection on orthologs’ expression in other species.
Genes with higher ECCs also had a stronger effect on fitness in S. cerevisiae upon changing their expression level. We interrogated the total variation of previously measured expression-to-fitness curves11 to calculate a ‘fitness responsivity’ score that captures the dependence of fitness on expression (Extended Data Fig. 5, Methods). Fitness responsivity was significantly positively correlated with the ECC (Supplementary Fig. 2e, P = 0.003, Spearman ρ = 0.326). Fitness responsivity was not associated with regulatory sequence divergence per se across the promoter sequence (as estimated by mean Hamming distance among orthologous promoters, Methods, Supplementary Fig. 2d, P = 0.46, Spearman ρ = 0.083). Thus, while stabilizing selection on gene expression (as captured by the ECC) can shape the types of mutations that accumulate in the population, it may have little effect on the overall rate at which mutations accumulate in promoter regions within populations, which has been previously used to test for evidence of selection.
Stabilizing selection shapes robustness
While a gene’s ECC (computed from the natural genetic variation in regulatory DNA) represents the imprint of its evolutionary history, its mutational robustness (assessed directly from the gene’s promoter sequence) should describe how future mutations would affect its expression43. Across all native yeast promoters, the magnitude of expression changes predicted by the transformer model due to single base-pair mutations follows a power law with an exponent of 2.252 (standard error of fit σ = +/− 0.002, P = 2.4*10−263), such that a small number of mutations have an outsized effect on expression (~10% of mutations account for ~50% of the changes in expression, Extended Data Fig. 4d). In individual genes, the distribution can vary substantially (below).
For a given promoter sequence, we defined the mutational robustness of a sequence length L, as the percent of its 3L single nucleotide mutational neighbors predicted by the transformer model to result in a negligible change in expression (Extended Data Fig. 4c, Methods), following previous definitions of mutational robustness25,43. The mutational robustness of a gene’s promoter sequence was positively correlated with the gene’s fitness responsivity (Supplementary Fig. 2f, Spearman ρ = 0.476, P = 8.18*10−6), suggesting that fitness-responsive genes have evolved more mutationally robust regulatory sequences. Mutational robustness, which, unlike the ECC, is computed for single sequences without a set of variants across a population, was also correlated to the ECC (Supplementary Fig. 2g, Spearman ρ = 0.515, P = 9.99*10−7). Similarly, the promoter sequences of yeast genes with conserved expression across Saccharomyces strains41, Ascomycota species, or mammals42 had higher mutational robustness (P = 8.4*10−3, 6.5*10−5, and 0.00377, respectively, two-sided Wilcoxon rank-sum test).
Thus, genes whose expression levels are under stabilizing selection have regulatory sequences that tend to be more robust to the impact of mutations, which may reflect their history and constrain their future.
Fitness landscapes in evolvability space
Mutational robustness enables the exploration of novel genotypes that could subsequently facilitate adaptation and thus promote evolvability, the ability of a system to generate heritable phenotypic variation25. To characterize regulatory evolvability, we extended our description of mutational robustness by representing each sequence using a sorted vector of expression changes (predicted by the transformer model) that are accessible through single nucleotide mutations (Fig. 4a, left, Methods). This ‘evolvability vector’ captures the capacity for changes in genotype to alter expression phenotype, in line with previous definitions of evolvability25.
We next asked whether regulatory evolvability vectors fell into distinct classes by identifying evolvability ‘archetypes’. Archetypes44 represent the extremes of canonical patterns, such that the evolvability vector of each individual sequence can be represented by its similarity to each of several archetypes representing these extremes. Applying this paradigm, we used our transformer model to compute evolvability vectors for a new random sample of a million sequences and then learned a two-dimensional representation of these evolvability vectors (referred to as the ‘evolvability space’) using an autoencoder45 (Fig. 4a, right, Methods). This archetypal evolvability space, that is bounded by a simplex whose vertices represent evolvability archetypes (Fig. 4a, right, Methods) and where the evolvability vector of each sequence is a single point, allows us to effectively visualize arbitrarily large sequence spaces in two dimensions.
Three archetypes captured most of the variation in evolvability vectors (Extended Data Fig. 6a,b; Methods), corresponding to local expression minimum (Aminima), local expression maximum (Amaxima), and malleable expression (Amalleable) (Fig. 4b). Aminima and Amaxima correspond to sequences where most 3L mutational neighbors do not change expression, and the ones that do, increase it (for Aminima) or decrease it (for Amaxima). Conversely, for Amalleable sequences, most 3L mutational neighbors change expression and are equally likely to decrease or increase it (Fig. 4b). In addition to these three archetypes, mutationally robust sequences were present as a central cleft in the evolvability space (Fig. 4b,c; “robust”). The evolvability space also distinguishes native regulatory sequences by their associated expression level (Fig. 4d), with intermediate expression more likely to be near the malleable archetype (Amalleable) and depleted near the robustness cleft (Fig. 4d, Supplementary Information).
The location of sequences in evolvability space reflects the selective pressures operating on the sequence. Sequences under strong stabilizing selection on gene expression tend to be located far away from the malleable archetype: there is a strong negative correlation between malleable archetype proximity and mutational robustness (Extended Data Fig. 6c,e; Spearman’s ρ = −0.746, P = 1.97*10−15), the ECC (Extended Data Fig. 6d,f,g; ρ = −0.596, P = 5.4*10−9), fitness responsivity (Extended Data Fig. 6h; ρ = −0.413, P = 1.4*10−4), and expression conservation across species as measured by RNA-seq (Saccharomyces: P = 0.000251, Ascomycota: P = 0.00002, Mammals: P = 0.00114; two-sided Wilcoxon rank-sum test).
To visualize promoter fitness landscapes in two dimensions we combined our sequence-to-expression transformer model with previously measured expression-to-fitness curves11, and integrated them with the two-dimensional archetypal evolvability space (Fig. 4e, Extended Data Fig. 7, Methods). Unlike prior visualizations of fitness landscapes, which group sequences by their sequence similarity, here, sequences are arranged by the similarity in their evolvability. This approach effectively visualizes arbitrarily large sequence spaces in two-dimensions, as well as groups sequences by their evolutionary properties. This addresses the challenges otherwise posed by sequence similarity-based landscapes since highly similar regulatory sequences can have different functional properties (e.g., due to a loss of a TF binding site), while very different sequences can be functionally similar (e.g., due to shared TF binding sites). When organismal fitness is available for a particular gene and overlaid on the landscape (Fig. 4e, Extended Data Fig. 7), the resulting patterns depend on both the condition-specific sequence-to-expression function (e.g., governing color (fitness) through predicted expression, and embedded position, through evolvability) and the gene- and condition-specific expression-to-fitness functions.
Finally, we studied how natural yeast sequences explored evolutionary space, by placing the evolvability vectors of each of set of orthologous promoters of the 1,011 sequenced S. cerevisiae isolates37 in the archetypal evolvability space. When a gene’s promoter from one strain is near the malleable archetype, its orthologs in the other strains tended to broadly distribute in the evolvability space (Extended Data Fig. 6i), but avoid the robustness cleft (e.g., the DBP7 promoter from strain S288C; Fig. 4f). Conversely, when a promoter is near the robustness cleft (e.g., the UTH1 promoter from S288C), so are its orthologs (Fig. 4g, Extended Data Fig. 6i). Using in silico mutagenesis to interpret our model, we found that the DBP7 promoter is particularly malleable partly as a result of an intermediate affinity Rap1p binding site, where the most impactful mutations increased or decreased the Rap1p affinity for this site, impacting expression (Extended Data Fig. 8a). By contrast, the UTH1 promoter requires many sequential mutations, each of which has minimal impact individually, to reduce expression appreciably (Extended Data Fig. 8b). This could reflect the ways in which stabilizing selection constrains evolvability: promoters that are not under strong stabilizing selection explore expression space more freely and can quickly adapt to a new expression optimum, since the population likely already contains multiple alleles that achieve diverse expression levels (e.g. Fig. 4f). Interestingly, many of the native sequences in S. cerevisiae are near the robustness cleft (Fig. 4h).
Thus, the evolvability vector, which can be computed using our model directly for any sequence (without any population genetics data), encodes information about a sequence’s evolutionary history and potential futures.
DISCUSSION
Here, we presented a framework that addresses fundamental questions in the evolution and evolvability of regulatory sequences2,25. Our models, developed using a combination of large scale random sequence libraries, sensitive reporter assays and deep learning (Methods), are useful as “oracles” for model-guided biological sequence design19, and answering important questions in the study of fitness landscapes4–6, evolutionary malleability of expression and its variation across strains and species2, mutational robustness43, and evolvability25. The framework presented here will help advance synthetic biology, cell and gene therapy, and metabolic engineering in addition to the study of evolution.
Previous studies suggested that evolution favors more complex regulatory solutions46, but we showed that if stabilizing selection acts only on expression, regulatory complexity extremes gradually move towards the moderate complexity levels observed in native and random sequences (Fig. 2d). This supports a model where most extant regulatory sequences evolved by sampling constraint-satisfying solutions in proportion to their frequency in the sequence space, without specific consideration of the solution’s complexity.
In our study, evolving condition-specificity in a promoter sequence was much slower than simply modifying the expression level. Some yeast genes achieve condition-specificity by including multiple binding sites for condition-responsive TFs. For instance, the GAL1–10 Upstream Activating Sequence contains multiple binding sites for the galactose-responsive Gal4, which are conserved across millions of years, suggesting an ancient origin47. Because the size of the regulatory region restricts the number of TF binding site locations, including more TFs and more regulatory sequences per gene (e.g. enhancers) may be required for more complex regulatory programs observed in higher eukaryotes48.
The dN/dS ratio has been used extensively to characterize the evolutionary rates of protein coding genes34, and we developed an analogous2,35 coefficient, the ECC, for detecting evidence of selection on expression from natural variation across multiple orthologous regulatory sequences in strains of one species. The ECC complements and extends existing measures of expression conservation, since it integrates across the regulatory sequence and is not limited to specific TFs or binding motifs, does not require additional experiments to test the functions of mutations for each regulatory region, and does not rely on detecting non-uniformity in mutation distributions.
Complementing the ECC, mutational robustness as calculated with our model is predictive of selective pressures on individual sequences (Supplementary Fig. 2f-g). While we find that strong constraint on the function of regulatory sequences can shape them to be robust to future mutations, it is unlikely that robustness itself is the selected trait, since increased robustness to future mutations is likely to be of little marginal benefit43. Instead, this may reflect a secondary benefit of having evolved decreased expression noise49,50, or another as-yet-unknown mechanism. It may also reflect the fact that the sequences of some ancestral promoters may be similar to the mutational neighbors of extant sequences, and, if selective constraints on expression have remained stable, these ancestral and extant sequences likely have similar expression levels.
Based on our model-derived evolvability vectors, sequences spanned an evolvability spectrum from robust to malleable (Fig. 4c-d,f-h), and for native regulatory sequences, the magnitudes of accessible mutation effects follows a power law. Evolvability vectors also help visualize fitness landscapes4 (Fig. 4e, Extended Data Fig. 7) and future work can further improve our understanding of their topography4,5.
Our sequence-to-expression models are currently limited by regulatory region and species. For example, sequence mutations that affect other regulatory mechanisms (e.g., genomic context, mRNA processing and degradation, regulation by RNA-binding proteins, translational efficiency) can compensate for those that affect transcription. While our models emulated the biological process of our experimental system, as demonstrated by their excellent predictive power, future interpretability studies will shed further light on molecular mechanisms. Finally, for multicellular organisms, selection acts simultaneously on expression levels in many different cell types and environments. As models of gene regulation are created for other species, environments, and regulatory regions, our framework will help provide further insights into regulatory evolution.
METHODS
Experimental measurement of sequence-expression pairs using a Sort-seq strategy
We experimentally measured expression using a Sort-seq2,3,51–59 strategy called the Gigantic Parallel Reporter Assay (GPRA) we previously described26 (Supplementary Fig. 1). Briefly, for each set of expression measurements mentioned, random or designed single stranded oligonucleotides were ordered from IDT (random; Supplementary Table 3) or Twist Biosciences (designed; sequences on GEO; accession GSE163045), cloned into the promoter of a Yellow Fluorescent Protein (YFP) gene within a CEN plasmid (Addgene: 127546) as previously described26 and transformed into yeast (strain Y8205 for the training dataset of random sequences, and strain S288C::ura3 for all the rest of the sequences measured). The library is maintained in yeast as an episomal low copy number plasmid. It was previously reported that the expression measurements are highly correlated with expression levels as measured using integrated reporters (R2=0.97)54. Yeast were grown in continuous log phase, diluting as necessary to maintain an OD between 0.05 and 0.6 for 8–10 generations up until the time of harvest. Cells were harvested, washed once in ice cold PBS, and kept on ice in PBS until sorting. Cells were sorted into 18 uniformly-sized expression bins covering the majority of the expression distribution. Post sort, cells were re-grown in SD-Ura until saturation, plasmids isolated, and sequencing libraries created sequenced with a 150 cycle NextSeq kit. For libraries with random 80 bp sequences, sequences were consolidated as previously described26. Reads from other (defined, non-random) libraries were aligned to the pre-defined sequences using Bowtie260, including only reads that perfectly matched a designed sequence. For each sequence, the expression level was the average of the expression bins in which it was observed, weighted by the number of times it was observed in each bin. These expression measurements were carried out separately in defined media lacking uracil (SD-Ura (Sunrise Science, #1703–500)) and complex media (YPD: yeast extract, peptone, dextrose).
Architecture of the convolutional model
A deep neural network model20,21,23,61–69 with convolutional layers was constructed and used for designing sequences with high and low expression (Fig. 1c), and running evolutionary simulations under stabilizing selection, genetic drift, and SSWM (Fig. 2) for each condition. These designed sequences, whose expression was experimentally quantified (e.g. Fig. 1c and 2d,g), were designed using models with the following architecture:
Input.
The input is the DNA sequence (s) represented in one-hot encoding. Input Shape: (1, 110, 4)
Convolution Block
- For the forward and reverse strand, separately,
- Strand-specific convolution layer 1. Kernel Shape: (1, 30, 4, 256)
- Strand-specific convolution layer 2. Kernel Shape: (30, 1, 256, 256)
Concatenation of features from the forward and reverse strand
Convolution layer 3. Kernel Shape: (30, 1, 512, 256)
Convolution layer 4. Kernel Shape: (30, 1, 256, 256)
A bias term and a ReLU activation was added to each convolution layer in this block.
Fully Connected Layers
Fully connected layer 1. Kernel Shape: (110*256, 256).
Fully connected layer 2. Kernel Shape: (256, 256)
A bias term and a ReLU activation were added to each layer in this block.
Output.
Linear combination of the 256 features extracted as a result of all the previous operations on the sequence (s) to generate the predicted expression (e).
Every fully connected layer was L2 regularized with a 0.0001 weight and had a dropout probability of 0.2.
Training of the convolutional model
For training, 20,616,659 random sequences for the defined medium and 30,722,376 random sequences for the complex medium (each to train a separate model) were used, along with their experimentally measured expression as described above. A mini-batch size of 1,024 was used for training and a mean squared error loss was optimized using the Adam optimizer70 with an initial learning rate of 0.0005. The model was trained for 5 epochs. Model architecture was written in TensorFlow71 1.14 using Python 3.6.7. The convolutional model used TensorFlow graphs and sessions in its implementation and was thus incompatible with the Tensor Processing Units (TPUs)72. These convolutional models (for both media) were used for all the predictions in Fig. 1 and 2 and Extended Data Fig. 1, 2.
The models were tested by predicting expression on sequences that the model had never seen before (Supplementary Fig. 21) that were measured in separate experiments, where the library was lower complexity (fewer sequences) than the experiments that generated the training data, such that the expression associated with each sequence was measured with high accuracy (~100 yeast cells per sequence on average). The test libraries included random, native (i.e. present in the yeast genome), and designed sequences.
Training and evaluation were carried out on 4 Tesla M60 GPUs. All code for training and using the convolutional model is available here: https://github.com/1edv/evolution/tree/master/manuscript_code/model/gpu_only_model.
Architecture of the transformer model
A transformer model23,32,73 was developed to run inference faster than the convolutional model, as needed for the evolutionary analyses in Fig. 3 and 4. The transformer model had ~20x fewer parameters (~1.3 million, compared to the ~24 million parameters of the convolutional model) and was able to leverage Tensor Processing Units (TPUs) for computation. Transformer models are used in all the analyses in Fig. 3 and 4 and Extended Data Fig. 3-4, and 6-8. Benchmarking analyses and ablation analyses for the transformer model are available in the Supplementary Information.
The deep transformer model has the following architecture (Supplementary Fig. 12):
Input.
The input is the DNA sequence (s) represented in one-hot encoding. Input Shape: (110, 4)
Convolution Block.
The convolution block is constructed in the following order (Supplementary Fig. 12b):
Revere Complement Aware 1D Convolution. The forward and reverse strand are operated on separately with a convolutional kernel to generate strand specific sequence-environment interaction features. Kernel Shape: (30, 4, 256).
Batch Normalization
Rectified Linear Unit (ReLU)
Concatenation of Features from the forward and reverse strand
2D Convolution: Convolve over the combined features from both the strands to capture interactions between strands. Kernel Shape: (2, 30, 4, 256)
Batch Normalization
ReLU
1D Convolution. Kernel Shape: (30, 64, 64)
Batch Normalization
ReLU
Transformer Encoder Blocks.
Two transformer encoder blocks32,74,75 are constructed in the following order (Supplementary Fig. 12c):
Multi-Head Attention: 8 heads, capturing relations between features from different positions of (s) to compute a representation for the features extracted from the convolution block from (s).
Residual Connection
Layer Normalization
Feed Forward Layer with 8 units
Residual connection
Layer Normalization
Bidirectional LSTM layer.
A bidirectional LSTM layer to capture the long-range interactions between different regions of the sequence with 8 units and 0.05 dropout probability.
Fully Connected Layers (Supplementary Fig. 12d).
Two Fully connected layers with 64 Hidden Units, each consisting of ReLU and Dropout (0.05 dropout probability).
Output.
Linear Combination of 64 features extracted as a result of all the previous operations on the sequence (s) to generate the predicted expression (e).
Training of the transformer model
20,616,659 random sequences (defined medium) and 30,722,376 random sequences (complex medium), along with their experimentally measured expression, were used to train separate models for each media. Model architecture was written in TensorFlow71 1.14 using Python 3.6.7 with multiple open source libraries (citations, where relevant, are included in code for them). A mini-batch size of 1,024 was used for training and a mean squared error loss was optimized using a RMSProp optimizer76 with a learning rate of 0.001. The stopping criterion monitored was the ‘r-squared’ value and the model was allowed to train for 10 epochs without improvement before stopping training. Training was carried out on a Google Cloud Tensor Processing Unit (TPU)72 v3–8. Evaluation was carried out on 4 Tesla M60 GPUs. The model architecture visualization was generated using Netron 4.5.1. All processed data and models are publicly available on Zenodo at https://zenodo.org/record/4436477 and all code is available on GitHub at https://github.com/1edv/evolution/tree/master/manuscript_code/model/tpu_model. Transformer models are used in all the analyses in Fig. 3 and 4 and Extended Data Fig. 3-4, and 6-8.
The models were tested by predicting expression on test sequences that the model had never seen before (Supplementary Fig. 21) that were measured in separate experiments, which included random, native (i.e. present in the yeast genome), and designed sequences. To obtain expression measurements for each tested sequence that are more accurate than those from the high-complexity training data experiment, library complexity was limited such that each test promoter sequence is observed in ~100 yeast cells (Methods, Supplementary Information).
Gene expression engineering using a genetic algorithm for sequence design
To design77–80 new sequences with desired expression, a genetic algorithm (GA) was implemented with the distributed evolutionary algorithms in python (DEAP) package81. The mutation probability and the two-point crossover probability were set to 0.1 and the selection tournament size was 3. The initial population size was 100,000 and the GA was run for 10 generations. The convolutional model was used as the basis for the objective function for GA, which was maximized for high expression and minimized for low expression (maximizing negative predicted expression). The top 500 sequences were synthesized (by Twist Biosciences) and expression was measured experimentally using our reporter assay, as described above.
Characterizing random genetic drift
Simulation of random genetic drift (Fig. 2a) was initialized with a set of 5,720 random sequences, in generation 0. For each sequence in this starting set, a new single sequence was randomly picked from its 3L mutational neighborhood (the set of all sequences at a Hamming distance of 1 from a sequence of length L) and the difference in expression between the new sequence and the starting sequence was calculated using the convolutional model (Fig. 2b). This was done for each starting sequence to get generation 1. Each subsequent generation n was produced by picking a single sequence randomly from the 3L mutational neighborhood of each sequence in the preceding generation n-1. The simulation was carried out for 40 generations. Simulations were also subsequently repeated with the transformer model (Extended Data Fig. 3f), yielding concordant results.
For experimental validation, 1,000 random starting sequences were synthesized, introducing between one to three random mutations to these sequences. The expression levels of starting and mutated sequences were measured in both complex and defined media experimentally using our reporter assay. For 990 of these 1,000 starting sequences, experimental measurements were available for all three mutational distances. Additionally, 20 (median) separate single mutations were introduced to each of 196 native sequences, the sequences were synthesized, and their associated expression was measured similarly for both of these media; these were also included in the boxes for one mutational step in Fig. 2c and Extended Data Fig. 1e.
Characterizing the regulatory complexity of a sequence
To estimate the regulatory complexity82,83 of a sequence, the Gini coefficient of the regulatory interaction strengths for each TF was calculated. A new biochemical model was first trained with our defined media data to complement the existing one trained on complex media, using our published model architecture of TF binding and position-aware activity26 and the training procedure previously described26 (Supplementary Notes). The regulatory interaction strength was then individually calculated for each regulator by setting the concentration parameter for that TF (individually) to 0 in the learned model, and the biochemical model was used to quantify the resulting change in expression, as previously described26. The resulting vector of interaction strengths was used to calculate a Gini coefficient for each sequence, separately for the complex and defined media models. The Gini coefficient is a measure of inequality of continuous values within a population, most commonly applied to wealth or income, and ranges from 0 (all members of the population have equal wealth) to 1 (the wealth of a population is held by a single individual). Regulatory complexity for a sequence is then 1-Gini, such that 1 indicates that all TFs contribute equally to the regulation of the gene and 0 indicates that a single TF is solely responsible for its regulation. As starting points for our trajectories, 200 native promoter sequences (from −160 to −80, relative to the TSS) were chosen with relatively high regulatory complexity and another 200 were chosen with relatively low regulatory complexity, spanning the range of predicted expression levels, as starting points for our trajectories.
Trajectories for stabilizing selection on gene expression were designed using the convolutional model (Fig. 2d). Here, all sequences were required to maintain a predicted expression level within 0.5 of the original expression levels at all steps along the trajectory. There was no explicit constraint on regulatory complexity in this simulation of stabilizing selection. In order to ensure that expression was unchanged, expression levels were measured experimentally for sequences along a trajectory at growing mutational steps from the initial sequence (2, 4, 8, 16, 32 mutations). Any trajectories for which an expression measurement was missing for any experimentally tested sequence were excluded from all analyses, retaining 172 trajectories with initial low regulatory complexity and 192 trajectories with initial high regulatory complexity. Testing whether observed trends in regulatory complexity were affected by the degree to which expression was either predicted (by the convolutional model for 1–32 mutations) or observed (by the experiment at 2, 4, 8, 16, or 32 mutations) to be conserved, showed that the trends were robust to the degree of expression conservation (Supplementary Fig. 11).
Characterizing directional trajectories under SSWM
Simulations of trajectories under a Strong Selection-Weak Mutation (SSWM)84–86 regime were initialized with a set of native yeast promoter sequences (defined here as the subset from −160 to −80 relative to the TSS for all the genes in the yeast reference genome for which we had a good TSS estimate (Supplementary Table 3 in 26) as the starting generation 0. For each sequence in generation n, the sequence from its 3L mutational neighborhood that had the maximal (or separately, minimal) predicted expression using the convolutional model was picked to get generation n+1. The simulation was carried out for 10 rounds separately in the complex (Fig. 2f) and defined (Extended Data Fig. 1f) media. The simulations were subsequently repeated using the transformer model (Extended Data Fig. 3i-j).
For experimental validation, a subset of sequences from several generations were synthesized along mutational trajectories simulated by the convolutional model for complex media (10,322 sequences from 877 trajectories, 805 of which had every sequence along the trajectory successfully measured) and one for defined media (6,304 sequences from 637 trajectories, 591 of which had every sequence along the trajectory successfully measured) and their expression was measured in the corresponding media experimentally using our reporter assay (Fig. 2g, Extended Data Fig. 1g).
Measuring the URA3 expression-to-fitness relationship
Two complementary environments were studied with opposite selective pressures on the expression of URA3 (encoding an enzyme responsible for uracil synthesis): defined media, where organismal fitness increases with gene expression (up to saturation) and complex media + 5-FOA, where fitness decreases with Ura3p expression.
Convolutional models trained on defined and complex media were used to choose a set of 11 sequences that span a broad range of predicted expression levels in the two media when cloned into a YFP expression vector26. The relationship between expression of URA3 and organismal fitness in yeast was estimated from experimental measurements with these 11 sequences, by cloning promoter sequence in front of YFP to measure expression level and in front of URA3 to measure fitness. Unless otherwise noted, yeast were grown at 30°C, in an orbital shaker incubator at 225 RPM. Each vector was transformed into yeast (S288C::ura3), and three independent transformants were selected per vector to serve as biological replicates. For measuring expression, yeast were grown overnight in either YPD+NAT (yeast extract, peptone, dextrose, with 75µg/ml nourseothricin) or SD-Ura (synthetic defined media, lacking uracil; Sunrise Science 1703–500), and then re-inoculated in the morning and allowed to grow for 6 hours prior to measuring expression by flow cytometry for each replicate as the log ratio of YFP to the constant background RFP, including only cells obtaining the top 50% of RFP expression. Fitness was obtained by measuring the growth rate of each yeast strain in either SD-Ura or YPD+NAT+5-FOA (0.25 mg/ml 5-FOA). Yeast were grown continuously in triplicate in log phase, with linear shaking at 30°C in a Synergy H1 plate reader (Biotek), by diluting each well to maintain OD<0.7, with OD measured at 15 minute intervals. Growth rate was defined for each replicate as the median of the instantaneous smoothed growth rates over 5 measurements in log phase, considering only time points where 0.05<OD<0.5. Each promoter’s expression and growth rate were summarized as the mean of the three replicates.
Characterizing trajectories under conflicting expression objectives in different environments
Simulations of sequence evolution in two complementary environments with opposite selective pressures (defined media and complex media) were initialized with a set of native yeast promoter sequences (present at −160 to −80 relative to the TSS) as the starting generation 0, with the objective function defined as the difference in predicted expression between defined and complex media (Fig. 2h, Extended Data Fig. 2d-g) using convolutional models trained in the respective media. The difference in expression between the two conditions was maximized at each iteration, which assumes that the cells are exposed to both environments before the mutations can reach fixation, an example of evolution in rapidly fluctuating environments31. For simplicity, it is assumed that fitness is directly proportional to higher expression in one condition and to lower expression in the other, such that mutations will be considered favorable even if they decrease fitness in one condition so long as they increase it in the other condition by a greater amount.
One simulation aimed to maximize the expression difference (defined minus complex), and the other to minimize it (maximizing complex minus defined). For each sequence in generation n, the sequence from its 3L mutational neighborhood that had the maximum (or separately, minimum) value for the objective function based on the convolutional model prediction is picked for generation n+1, to a total of 10 generations. The simulations were subsequently repeated using the transformer model yielding similar results (Supplementary Fig. 17b-f).
Motifs that were enriched in the sequences of generation 10 compared to the starting sequences were identified de novo using DREME87, and each of the top 5 consensus motifs were used as queries to search the YeTFaSCo database88, reporting the closest match, or one of multiple similar matches.
Finding orthologous promoters in the 1,011 S. cerevisiae genomes dataset
To identify orthologs of S288C promoters in the whole genome sequences of the 1,011 yeast strains37, BLAT89 was used to identify regions of ≥80% identity with each −160 to −80 region (relative to the TSS) annotated in the reference S288C genome sequence (R64)90. Any strains with more than one such match, where the match contained insertions or deletions, or had incomplete matches, were excluded on a gene-by-gene basis. Genes with more than 1.2 matches with ≥80% identity per genome, on average, were excluded altogether.
Computing the expression conservation coefficient (ECC)
To calculate the ECC (a regulatory analog2,35,91,92 of dN/dS34,93,94), for each yeast gene promoter, the transformer model was used to predict an expression value for each orthologous promoter in the 1,011 yeast genomes (above), defining an expression distribution with a standard deviation σB. Next, a set of sequences with random mutations was generated from each gene’s consensus promoter sequence (defined as the most abundant base at each position across the strains), such that the number of sequences at each Hamming distance from the consensus promoter sequence was the same for the natural and simulated sets. Here, mutations introduced to create random variation sampled each base with equal probability; using observed mutation rates yielded similar results (Supplementary Information). The same transformer model to predict the expression of the simulated sequences, and calculate its standard deviation σC. The nominal ECC is log(σC/σB). Because the variance on simulated sequences is better estimated than in natural orthologs (whose sequences may be more constrained), a constant correction factor is subtracted, calculated by creating a second simulated set of randomly mutated sequences whose diversity is limited to the same extent as in the natural set, by creating only one random mutation for every unique sequence in the set of native orthologs. Finally, the expression for this second set of sequences is predicted by the transformer model, and its standard deviation (σC’) is used to calculate a null ECC for each gene (log(σC/σC’)); the median of these null ECCs over all the genes is used as the constant correction factor . (An extensive description of the correction factor is provided in the “ECC calculation details and considerations” section of the Supplementary Information.)
The corrected ECC for gene g is then:
The computed ECC values for all yeast genes, available in Supplementary Table 1, were used to identify cases or presumed stabilizing selection (selection favoring a fixed non-extreme value of a trait), diversifying (disruptive) selection (selection favoring more than one extreme values of a trait; as opposed to a single fixed intermediate value), and directional (positive) selection (selection favoring a single extreme value of a trait over all other possible values of the trait). Re-computing the ECC values for all yeast genes using the S288C reference sequences instead of the consensus sequence for the promoters of each gene yielded very similar results.
In addition to each ECC value, a Z-score and p-values for the confidence that the observed ECC values differ from neutrality were also calculated. For each gene’s true ECC, a set of matched random ECC values were calculated, where the denominator is a set of sequences matched for Hamming distance distribution and the total number of unique sequences. The null ECC mean and standard deviation were calculated from 1,111 such simulations, and used to calculate a Z-score for how extreme the actual ECC would be under this null distribution. This Z-score acts as a signed p-value (negative representing divergent expression and positive representing conservation), from which p-values (using the ‘scipy.stats.norm.sf’ function on the absolute value of the Z-score in Scipy95 and multiplying the function’s output by 2 to get a two-sided p-value) (Supplementary Table 1).
Inferring expression conservation across Saccharomyces species using RNA-seq data and comparing with ECC values
Published RPKM values for orthologs of S. cerevisiae genes in closely related Saccharomyces species41 were obtained from the Gene Expression Omnibus (GEO) (accession GSE83120). Only genes for which expression was quantified in all species were used in subsequent analysis. RPKM values were log2 scaled after adding a pseudo count of 2, and the variance in expression of each gene across the species was calculated. Genes were ranked by their gene expression variance, and the 2% of genes with the lowest variance were considered as having conserved gene expression levels (‘expression conserved’), while the 2% with the highest variance were considered ‘expression not-conserved’. The significance of the differences was robust to the choice of these thresholds (Supplementary Information). To compare to ECC values, the p-value of a two-sided Wilcoxon rank-sum test was estimated comparing the ECC values for genes in the ‘expression conserved’ and ‘expression not-conserved’ categories (implemented using the scipy.stats.ranksums SciPy95 function). To control for the dependence between expression mean and variance, the analysis was repeated using the coefficient of variation (P = 1.05*10−4) and the coefficient of dispersion (P = 2.42*10−4) instead of variance, yielding similar results.
Experimental protocol for RNA-seq measurements from 11 Ascomycota species
RNA-seq was performed on samples from the following 11 Ascomycota yeast species: Saccharomyces cerevisiae, Saccharomyces bayanus, Naumovozyma (Saccharomyces) castellii, Candida glabrata, Kluyveromyces lactis, Kluyveromyces waltii, Candida albicans, Yarrowia lipolytica, Schizosaccharomyces japonicus, Schizosaccharomyces octosporus, and Schizosaccharomyces pombe. Each of the 11 species was grown in BMW medium, chosen to minimize cross-species growth differences, as previously described96. N. castellii was grown at 25℃ while the other species were grown at 30℃. RNeasy Midi or Mini Kits (Qiagen, Valencia, CA) were used to isolate total RNA from log-phase cells by mechanical lysis using the manufacturer instructions as previously described96. dUTP strand-specific RNA-seq libraries were constructed as previously described97 with the following modifications. (1) The polyA+-selected RNA was fragmented in a 40µl reaction containing 1x Fragmentation Buffer (Affymetrix) by heating at 80℃ for 4 minutes followed by cleanup via ethanol precipitation for all libraries (except Y. lipolytica, S. pombe, S. japonicus, and S. octosporus; for these species, the conditions described previously were used97), followed by cleanup via 1.8x RNAClean XP beads (Beckman Coulter Genomics). (2) For C. glabrata, K. lactis, S. bayanus, S. pombe, S. japonicus, and S. octosporus libraries, the adapter ligation was performed overnight at 16℃. For the rest, this was done at 16℃ for 2 hours as described previously97. (3) Normalization was carried out based on the cDNA input and pooling of selected Illumina barcoded-adaptor-ligated cDNA products followed by gel size selection occurred as follows: range of 275 to 575 bp for pooled C. albicans, K. waltii, and N. castellii libraries, and 375 to 575 bp for C. glabrata, K. lactis, and S. bayanus libraries. For the other libraries, no pooling was performed before gel size-selection – range of 310 to 510 bp for Y. lipolytica and 350 to 550 bp for S. pombe, S. japonicus, and S. octosporus. (4) The final PCR product was purified by 1.8x AMPure XP beads (Beckman Coulter Genomics) followed by a second gel size-selection for the range of 300 to 575 bp for C. albicans, K. waltii, and S. castellii libraries, but no second gel size-selection was performed for the other libraries. The pooled final library was sequenced on one to four lanes of HiSeq2000 (Illumina) with 68 base (Y. lipolytica had 76 base) paired-end reads and 8 base index reads.
Transcript assembly, mapping and expression calculation for the 11 Ascomycota species RNA-seq
For each of the 11 Ascomycota yeast species above, reads were assembled using Trinity98(version ‘trinityrnaseq_r2012–05-18’) and the assembled transcripts were mapped onto the assemblies to the respective genomes using GMAP99. The Jaccard coefficient was used to join adjacent assemblies given enough connecting reads (using the Trinity default of 0.35 for the Jaccard cutoff). Finally, upon mapping all assembled transcripts, the Jaccard coefficient was used to clip assemblies which did not have enough support over a certain region. For each of the species, assembled transcripts were mapped to the genome sequence100 using BLAT89. Estimated expression values were calculated for each transcript using RSEM101 (defined in RSEM as the estimate of the number of fragments that are derived from a given isoform or gene, or the expectation of the number of alignable and unfiltered fragments that are derived from an isoform or gene given the maximum likelihood abundances). Only reads mapping to the sense mRNA strand were considered. Orthology between genes in different species was used as previously described100.
Inferring expression conservation across Ascomycota species using our RNA-seq data and comparing with ECC values
Estimated expression values from the 11 Ascomycota species RNA-seq data were used after removing all genes with NA values in expression for more than three species. Estimated expression values were log2 scaled after adding a pseudo count of 1, and the variance in expression for each gene across the species was calculated. Genes were ordered by their variance in expression across the reported fungal species. Here, the 10% of genes with the lowest expression variance were considered to have ‘conserved’ expression, and the 10% with highest expression variance were considered to have expression ‘not conserved’. To compare to ECC values, the p-value of a two-sided Wilcoxon rank-sum test was estimated comparing the ECC values for genes in the ‘conserved’ and ‘not conserved’ categories (implemented using the scipy.stats.ranksums SciPy95 function). Similar results were obtained when repeating the analysis using the coefficient of variation (P = 4.22*10−5) and the coefficient of dispersion (P = 8.05*10−5) instead of variance.
Inferring expression conservation across mammalian species using RNA-seq data and comparing with ECC values
Ensembl Biomart102 was used to find one to one orthologs of S. cerevisiae genes in humans (of ‘Human homology type’ ‘ortholog_one2one’; all ‘ortholog_one2many’ and ‘many2many’ orthologs were excluded). For these human orthologs of yeast genes, the previously reported ‘evolutionary variance’ values across mammalian species from the original publication42 (based on an Ornstein Uhlenbeck (OU model)42) were directly used. Here, the 25% of genes with the lowest ‘evolutionary variance’ were considered to have conserved expression and the top 25% were considered to be not conserved (the same thresholds used in the original study42). This was done separately for each profiled tissue (brain, heart, kidney, liver, lung and skeletal muscle). Subsequently, a human ortholog for a yeast gene was considered to have conserved (or non-conserved) expression if it was found to have conserved (or non-conserved) expression in at least one of the profiled tissues. Genes with conflicting expression conservation classes across tissues were excluded from the analysis. To compare to ECC values, the p-value of a two-sided Wilcoxon rank-sum test was estimated comparing the ECC values for genes in the “conserved” and “not conserved” categories (implemented using the scipy.stats.ranksums SciPy95 function).
Quantifying sequence dissimilarity using mean Hamming distance
For each group of orthologous yeast gene promoters (with ungapped alignments), the mean of Hamming distances between each pair of orthologous promoters across the 1,011 isolates was calculated.
Generation of CDC36 promoter strains by allele swapping
Strains with a restored Upc2p binding site in the CDC36 promoter region were obtained using a previously described CRISPR-Cas9 method103. Guide RNAs (gRNAs) were designed using the Benchling online tool (https://www.benchling.com/) and cloned in a pGZ110 derived plasmid104, using standard “Golden Gate Assembly”105. Plasmids carrying the gRNA and Cas9 gene were then co-transformed with a synthetic DNA fragment (ssODN) composed of a 100 bp sequence with perfect complementarity to the background promoter sequence (WE) but for the centrally-located targeted alleles that overlap the Upc2p binding site. Allele swapping was confirmed by Sanger sequencing (Macrogen, South Korea). Sequences were analyzed using the SGRP (Saccharomyces Genome Resequencing Project) BLAST server (http://www.moseslab.csb.utoronto.ca/sgrp/blast_new/) and MUSCLE tool in Geneious v10.1. All primers and ssODNs used are listed in Supplementary Table 2.
RNA extraction and qPCR of CDC36
Gene expression analysis was performed by qPCR from cultures growth in SD medium supplemented with uracil (0.02% p/v). Samples were grown until exponential phase (OD 0.6–0.8), collected by centrifugation and treated with 10 units of Zymolyase 20T (50mg/ml) for 30 min at 37°C. RNA was extracted using E.Z.N.A Total RNA kit I (OMEGA) according to manufacturers’ instructions. Genomic DNA traces were then removed by treating samples with DNase I (Promega). RNA concentrations were estimated using a Qubit system and verified by 1.5% agarose gel. RNA extractions were performed in three biological replicates.
cDNA was synthesized using 200 units of M-MLV Reverse transcriptase (Promega), 0.5 µg of Oligo (dT)15 primer and 1 µg of RNA in a final volume of 25 µL according to manufacturers’ instructions. qPCR reactions were carried out using Brilliant II SYBR® Green QPCR Master Mix (Agilent Technologies) in a final volume of 10 µL, containing 0.2 µM of each primer and 1 µL of the cDNA previously synthesized. qPCR reactions were carried out in three technical replicates per biological replicate using an Eco Real-Time PCR system (Illumina, Inc.) under the following conditions: 95°C for 15 min and 40 cycles at 95°C for 10 s and 58°C for 30 s. Primers used are listed in Supplementary Table 2. The relative expression of CDC36 was quantified using the 2(-ΔΔCt) approach106, and normalized with two housekeeping genes as previously described107, using the median Ct of the three technical replicates for each sample. The housekeeping genes ACT1 and RPN2 were used as previously described108.
Growth curves of CDC36 mutant and wild type alleles
Growth curves incorporating carbon source switching from glucose to galactose were generated as previously described109. Pre-cultures were grown in YNB containing 5% glucose medium at 30ºC for 24 h. Cultures were then diluted to an initial OD600nm of 0.1 in fresh YNB 5% glucose medium for an extra overnight growth. The next day, cultures were used to inoculate a 96-well plate with a final volume of 200μL YNB with 5% galactose with an initial OD600nm of 0.1. In parallel, a control plate containing YNB with 5% glucose was similarly inoculated. All experiments were performed in triplicate. OD600nm was monitored every 30 min using a Tecan Sunrise absorbance microplate reader (Tecan Group Ltd.). The kinetic parameters of lag phase, growth efficiency (ΔOD600 nm) and maximum specific growth rate (µmax) were determined as previously described110, fitting the curves with the Gompertz function using R version 3.3.2. All growth parameters are expressed as the ratio of growth within YNB+galactose to YNB+glucose to control for phenotypic variation that results from something other than the carbon source switch.
Fitness responsivity
The empirically-determined relationships between the expression levels to organismal fitness for each of 80 genes 11 were re-analyzed. Published expression-to-fitness curves in glucose media for each of 80 genes were obtained from the Supplementary Data of the original publication11. For each of these curves, the total variation (Extended Data Fig. 5) was calculated by partitioning the expression range into 36 regular intervals (as reported in the ‘impulse fit’ of the expression-to-fitness curves in the original publication11) and summing the absolute difference in fitness at the endpoints of each partition as follows , for each gene’s expression-to-fitness function, . The same qualitative relationship between a gene’s ECC and fitness responsivity as reported in other studies111–113 was observed, including LCB2 (ECC 2.15 and high fitness responsivity112) and MLS1 (ECC −1.32 and extremely low fitness responsivity113).
Mutational robustness
For every sequence, mutational robustness was defined as the fraction of sequences in its 3L mutational neighborhood that altered the expression by an amount less than ϵ, where ϵ is set at two times the standard deviation of expression variance across all genes with an ECC >0 (here, ϵ = 0.1616; ECC calculated using the 1,011 S. cerevisiae genomes, Extended Data Fig. 4c). Using different values for ϵ yielded very similar results.
The evolvability vector
To derive the evolvability vector for a given sequence, expression changes associated with single base changes in every possible position were sorted to obtain a monotonically increasing vector of length 3L for each sequence of length L (here, L=80; 3L=240; Fig. 4a, left, Methods). Formally, to compute an evolvability vector for a sequence s0, for each sequence si in the 3L mutational neighborhood of s0, the difference between the predicted expression of si and that of s0 : was calculated, where represents the predicted expression of the transformer model. The evolvability vector is defined as the vector D , sorted such that (i.e. di values are in ascending order).
Power law distribution analysis
The list of the absolute values of the evolvability vectors for all native sequences was used to define the distribution of the magnitude of the expression effect of mutations. The powerlaw114 Python package was used to determine whether the data fit a power law distribution. The ‘Fit’ function with an ‘xmin’ parameter of 0.5 was used to determine the exponent and the ‘distribution_compare’ function was used to determine the p-value for the fit (Extended Data Fig. 4d, Supplementary Fig. 2h).
Characterizing the archetypal evolvability space
The evolvability vectors for a new random sample of a million sequences were used as input to an autoencoder with an archetypal regularization constraint45 on the embedding layer. The autoencoder was trained using the AANet implementation made available with the publication45 with no noise added to the archetypal layer during training, a linear activation on the output layer, an equal weight of 1 on each of the loss terms (the mean squared error loss term along with the non-negativity and convexity constraints), a learning rate of 0.001, and a minibatch size of 4,096. The autoencoder accepts an evolvability vector (of length 240 for an 80bp sequence) as input to the first encoder layer, where each node in the input layer is connected to each node in the encoder layer (fully connected layer). Every layer in the autoencoder was fully connected. The encoder architecture used was [1024, 512, 256, 128, 64], where each entry corresponds to the number of nodes in the corresponding hidden layer and the decoder architecture was the encoder’s mirror image. The output layer was the same shape as input layer and each node in the last decoder layer was connected to each node in the output layer. To select the optimal number of archetypes, the autoencoder was first trained for a 1,000 minibatches separately for 1 to 9 archetypes. Following the recommended approach45 for picking the optimal number of archetypes, we used an elbow plot of mean squared error on the evolvability vectors (here, using native sequences) vs. the number of archetypes in the autoencoder (Extended Data Fig. 6a).
The autoencoder was then trained from scratch with 3 archetypes, using the full training data and parameters for 250,000 batches. Since this autoencoder aims to reconstruct the original evolvability vector for each sequence by learning feature representations after passing them through an information bottleneck, its reconstruction accuracy was first verified on the set of native yeast promoter sequences (Extended Data Fig. 6b, Pearson’s r = 0.992). To visualize the evolvability vectors corresponding to sequences in 2 dimensions (2D), the evolvability vectors corresponding to the three archetypes were first generated by decoding their archetypal latent space coordinates ((1,0,0), (0,1,0) and (0,0,1)) through the decoder, and MDS was performed on the decoded evolvability vectors of the archetypes. Then, as previously described45, the encoded evolvability vector of each new sequence was projected into the 2D MDS space by representing it as a mixture of the archetypes and interpolating them between the MDS coordinates of each archetype. For every sequence, the following equivalent representations can now be computed: (i) its evolvability vector, (ii) an archetypal triplet quantifying the similarity of its encoded (latent space) evolvability vector to the three archetypes and (iii) a two-dimensional multidimensional scaling (MDS) coordinate45 for visualizing the evolvability vectors. The representation of the evolvability vector for each sequence in this archetypal space is now bounded by a simplex (whose vertices correspond to the 3 evolvability archetypes). For each native and natural yeast promoter sequence from the sequence space, the archetypal triplet and MDS coordinates were inferred using its evolvability vector with this trained autoencoder. The MDS coordinates for the archetypes and the native yeast promoter sequences were used to generate the visualizations of the sequence space shown. This archetypal characterization of evolvability vectors allows the encoding and visualization of sequences by their evolvability in the context of a fitness landscape.
Visualizing promoter fitness landscapes
1000 random sequences were sampled and projected onto the MDS coordinate system for visualizing the sequence space described above. The expression level of each sequence was calculated using our model, and expression values were scaled so that the minimum was 0 and maximum was 1. Previously quantified expression-to-fitness relationships11 to compute fitness (fraction of wildtype growth rate) by using cubic spline interpolation (implemented using the scipy.interpolate.CubicSpline SciPy95 function) on the expression level after scaling the measured expression-to-fitness curves to have an expression range of 0 to 1. These fitness values were then used to generate the contour plots (implemented using the matplotlib.pyplot.tricontourf function; Fig. 4e, Extended Data Fig. 7) that visualize the fitness landscape in that gene’s promoter sequence space.
Extended Data
Supplementary Material
Acknowledgements
We thank Google TPU Research Cloud for TPU access, Leslie Gaffney for help with figure preparation, Broad Genomics Platform for sequencing work, Jan-Christian Hütter for advice on fitness responsivity, Jenna Pfiffner-Borges for help with RNA-seq, Ruby Yu, Byron Lee and Nima Jaberi for manuscript feedback and members of the Regev lab for discussions. E.D.V. was supported by the MIT Presidential Fellowship. C.G.D. was supported by a Canadian Institutes for Health Research Fellowship and the NIH (K99-HG009920–01). F.A.C. and J.M. were supported by ANID - Programa Iniciativa Científica Milenio - ICN17_022. Work was supported by the Klarman Cell Observatory and HHMI. A.R. was an Investigator of the HHMI.
Footnotes
Conflict of Interest statement
A.R. is a co-founder and equity holder of Celsius Therapeutics, an equity holder in Immunitas, and until July 31, 2020 was an S.A.B. member of ThermoFisher Scientific, Syros Pharmaceuticals, Neogene Therapeutics and Asimov. From August 1, 2020, A.R. is an employee of Genentech. The other authors declare no competing interests.
Code Availability
Code is available on GitHub at https://github.com/1edv/evolution and CodeOcean at https://codeocean.com/capsule/8020974/tree. A web app is available at https://1edv.github.io/evolution/.
Data Availability
Data generated for this study are available on NCBI’s GEO, accession numbers GSE163045 and GSE163866. All models and processed data are available on Zenodo at https://zenodo.org/record/4436477.
References
- 1.Wittkopp PJ & Kalay G Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat. Rev. Genet 13, 59–69 (2011). [DOI] [PubMed] [Google Scholar]
- 2.Hill MS, Vande Zande P & Wittkopp PJ Molecular and evolutionary processes generating variation in gene expression. Nature Reviews Genetics 1–13 (2020). [DOI] [PMC free article] [PubMed]
- 3.Fuqua T et al. Dense and pleiotropic regulatory information in a developmental enhancer. Nature 1–5 (2020). [DOI] [PMC free article] [PubMed]
- 4.de Visser JAGM & Krug J Empirical fitness landscapes and the predictability of evolution. Nat. Rev. Genet 15, 480–490 (2014). [DOI] [PubMed] [Google Scholar]
- 5.Kondrashov DA & Kondrashov FA Topological features of rugged fitness landscapes in sequence space. Trends in Genetics 31, 24–33 (2015). [DOI] [PubMed] [Google Scholar]
- 6.de Visser JAGM, Elena SF, Fragata I & Matuszewski S The utility of fitness landscapes and big data for predicting evolution. Heredity (Edinb) 121, 401–405 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Weirauch MT & Hughes TR Conserved expression without conserved regulatory sequence: the more things change, the more they stay the same. Trends Genet 26, 66–74 (2010). [DOI] [PubMed] [Google Scholar]
- 8.Orr HA The genetic theory of adaptation: a brief history. Nature Reviews Genetics 6, 119–127 (2005). [DOI] [PubMed] [Google Scholar]
- 9.Weinreich DM, Lan Y, Wylie CS & Heckendorn RB Should evolutionary geneticists worry about higher-order epistasis? Current Opinion in Genetics & Development 23, 700–707 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Venkataram S et al. Development of a Comprehensive Genotype-to-Fitness Map of Adaptation-Driving Mutations in Yeast. Cell 166, 1585–1596.e22 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Keren L et al. Massively Parallel Interrogation of the Effects of Gene Expression Levels on Fitness. Cell 166, 1282–1294.e18 (2016). [DOI] [PubMed] [Google Scholar]
- 12.Sarkisyan KS et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ogden PJ, Kelsic ED, Sinai S & Church GM Comprehensive AAV capsid fitness landscape reveals a viral gene and enables machine-guided design. Science 366, 1139–1143 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Pitt JN & Ferré-D’Amaré AR Rapid construction of empirical RNA fitness landscapes. Science 330, 376–379 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shultzaberger RK, Malashock DS, Kirsch JF & Eisen MB The Fitness Landscapes of cis-Acting Binding Sites in Different Promoter and Environmental Contexts. PLOS Genetics 6, e1001042 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Mustonen V, Kinney J, Callan CG & Lässig M Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites. Proc. Natl. Acad. Sci. U.S.A 105, 12376–12381 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hartl DL What Can We Learn From Fitness Landscapes? Curr Opin Microbiol 0, 51–57 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Otwinowski J & Nemenman I Genotype to phenotype mapping and the fitness landscape of the E. coli lac promoter. PLoS ONE 8, e61570 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Sinai S & Kelsic ED A primer on model-guided exploration of fitness landscapes for biological sequence design. arXiv:2010.10614 [cs, q-bio] (2020).
- 20.Zhou J & Troyanskaya OG Predicting effects of noncoding variants with deep learning-based sequence model. Nature Methods 12, 931–934 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Avsec Ž et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet 53, 354–366 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Shrikumar A, Greenside P & Kundaje A Learning Important Features Through Propagating Activation Differences. arXiv [cs.CV] (2017).
- 23.Avsec Ž et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Fragata I, Blanckaert A, Louro MAD, Liberles DA & Bank C Evolution in the light of fitness landscape theory. Trends in Ecology & Evolution 34, 69–82 (2019). [DOI] [PubMed] [Google Scholar]
- 25.Payne JL & Wagner A The causes of evolvability and their evolution. Nature Reviews Genetics 20, 24–38 (2019). [DOI] [PubMed] [Google Scholar]
- 26.de Boer CG et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol 38, 56–65 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Crocker J et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell 160, 191–203 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Habib N, Wapinski I, Margalit H, Regev A & Friedman N A functional selection model explains evolutionary robustness despite plasticity in regulatory networks. Mol. Syst. Biol 8, 619 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gillespie JH Molecular Evolution Over the Mutational Landscape. Evolution 38, 1116–1129 (1984). [DOI] [PubMed] [Google Scholar]
- 30.Jerison ER & Desai MM Genomic investigations of evolutionary dynamics and epistasis in microbial evolution experiments. Curr. Opin. Genet. Dev 35, 33–39 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Sæther B-E & Engen S The concept of fitness in fluctuating environments. Trends Ecol. Evol 30, 273–281 (2015). [DOI] [PubMed] [Google Scholar]
- 32.Vaswani A et al. Attention is All you Need. in Advances in Neural Information Processing Systems 30 (eds. Guyon I et al.) 5998–6008 (Curran Associates, Inc., 2017). [Google Scholar]
- 33.Weirauch MT et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol 31, 126–134 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Yang N & Bielawski N Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. (Amst.) 15, 496–503 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Moses AM Statistical tests for natural selection on regulatory regions based on the strength of transcription factor binding sites. BMC Evolutionary Biology 9, 286 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Rifkin SA, Houle D, Kim J & White KP A mutation accumulation assay reveals a broad capacity for rapid evolution of gene expression. Nature 438, 220–223 (2005). [DOI] [PubMed] [Google Scholar]
- 37.Peter J et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature 556, 339–344 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Erb I & van Nimwegen E Transcription factor binding site positioning in yeast: proximal promoter motifs characterize TATA-less promoters. PLoS One 6, e24279 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Gilad Y, Oshlack A & Rifkin SA Natural selection on gene expression. Trends Genet 22, 456–461 (2006). [DOI] [PubMed] [Google Scholar]
- 40.Alhusaini N & Coller J The deadenylase components Not2p, Not3p, and Not5p promote mRNA decapping. RNA 22, 709–721 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Yang J-R, Maclean CJ, Park C, Zhao H & Zhang J Intra and Interspecific Variations of Gene Expression Levels in Yeast Are Largely Neutral: (Nei Lecture, SMBE 2016, Gold Coast). Mol. Biol. Evol 34, 2125–2139 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Chen J et al. A quantitative framework for characterizing the evolutionary history of mammalian gene expression. Genome Res 29, 53–63 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Payne JL & Wagner A Mechanisms of mutational robustness in transcriptional regulation. Front. Genet 6, (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Shoval O et al. Evolutionary trade-offs, Pareto optimality, and the geometry of phenotype space. Science 336, 1157–1160 (2012). [DOI] [PubMed] [Google Scholar]
- 45.van Dijk D et al. Finding Archetypal Spaces Using Neural Networks in (IEEE, 2019). [Google Scholar]
- 46.He X, Duque TSPC & Sinha S Evolutionary origins of transcription factor binding site clusters. Mol Biol Evol 29, 1059–1070 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Cliften PF et al. Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Res 11, 1175–1186 (2001). [DOI] [PubMed] [Google Scholar]
- 48.Heinz S, Romanoski CE, Benner C & Glass CK The selection and function of cell type-specific enhancers. Nature Reviews Molecular Cell Biology 16, 144–154 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Lehner B Selection to minimise noise in living systems and its implications for the evolution of gene expression. Molecular Systems Biology 4, 170 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Metzger BPH, Yuan DC, Gruber JD, Duveau F & Wittkopp PJ Selection on noise constrains variation in a eukaryotic promoter. Nature 521, 344–347 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
Methods References
- 51.Kosuri S et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl. Acad. Sci. U. S. A 110, 14024–14029 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Shalem O et al. Systematic dissection of the sequence determinants of gene 3’ end mediated expression control. PLoS Genet 11, e1005147 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Kinney JB, Murugan A, Callan CG Jr & Cox EC Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl. Acad. Sci. U. S. A 107, 9158–9163 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Sharon E et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol 30, 521–530 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Melnikov A et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nature biotechnology 30, 271–277 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Kwasnieski JC, Mogno I, Myers CA, Corbo JC & Cohen BA Complex effects of nucleotide variants in a mammalian cis-regulatory element. Proceedings of the National Academy of Sciences of the United States of America 109, 19498–19503 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Kircher M et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nature Communications 10, 3583 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Townsley KG, Brennand KJ & Huckins LM Massively parallel techniques for cataloguing the regulome of the human brain. Nat. Neurosci 23, 1509–1521 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Renganaath K et al. Systematic identification of cis-regulatory variants that cause gene expression differences in a yeast cross. Elife 9, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Langmead B & Salzberg SL Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Alipanahi B, Delong A, Weirauch MT & Frey BJ Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology 33, 831–838 (2015). [DOI] [PubMed] [Google Scholar]
- 62.T, C. et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 15, (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Avsec Ž et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nature Biotechnology 37, 592–600 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Quang D & Xie X FactorNet: A deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods (San Diego, Calif.) 166, 40–47 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Shrikumar A, Greenside P & Kundaje A Reverse-complement parameter sharing improves deep learning models for genomics. bioRxiv 103663 (2017). [Google Scholar]
- 66.Morrow A et al. Convolutional Kitchen Sinks for Transcription Factor Binding Site Prediction. arXiv:1706.00125 [q-bio] (2017).
- 67.Kelley DR, Snoek J & Rinn JL Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26, 990–999 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Koo PK, Majdandzic A, Ploenzke M, Anand P & Paul SB Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput. Biol 17, e1008925 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Quang D & Xie X DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res 44, e107 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Kingma DP & Ba J Adam: A Method for Stochastic Optimization. arXiv [cs.LG] (2014).
- 71.Abadi M et al. TensorFlow: A system for large-scale machine learning. arXiv:1605.08695 [cs] (2016).
- 72.Jouppi NP et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. arXiv:1704.04760 [cs] (2017).
- 73.Li J, Pu Y, Tang J, Zou Q & Guo F DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences. Brief. Bioinform (2020) doi: 10.1093/bib/bbaa159. [DOI] [PubMed]
- 74.Ullah F & Ben-Hur A A self-attention model for inferring cooperativity between regulatory features. Nucleic Acids Res (2021) doi: 10.1093/nar/gkab349. [DOI] [PMC free article] [PubMed]
- 75.Clauwaert J, Menschaert G & Waegeman W Explainability in transformer models for functional genomics. Brief. Bioinform (2021) doi: 10.1093/bib/bbab060. [DOI] [PMC free article] [PubMed]
- 76.Hinton G & Tieleman T Lecture 6.5---RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning (2012).
- 77.Sinai S et al. AdaLead: A simple and robust adaptive greedy search algorithm for sequence design. arXiv:2010.02141 [cs, math, q-bio] (2020).
- 78.Linder J, Bogard N, Rosenberg AB & Seelig G A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences. Cell Systems 11, 49–62.e16 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Brookes David and Park Hahnbeom and Listgarten Jennifer. Conditioning by adaptive sampling for robust design. Proceedings of Machine Learning Research (2020).
- 80.Killoran N, Lee LJ, Delong A, Duvenaud D & Frey BJ Generating and designing DNA with deep generative models. arXiv:1712.06148 [cs, q-bio, stat] (2017).
- 81.Fortin F-A, Rainville F-MD, Gardner M-A, Parizeau M & Gagné C DEAP: Evolutionary Algorithms Made Easy. Journal of Machine Learning Research 13, 2171–2175 (2012). [Google Scholar]
- 82.Jaeger SA et al. Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites. Genomics 95, 185–195 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Tanay A Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res 16, 962–972 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Sniegowski PD & Gerrish PJ Beneficial mutations and the dynamics of adaptation in asexual populations. Philos. Trans. R. Soc. Lond. B Biol. Sci 365, 1255–1263 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Szendro IG, Franke J, Visser J. A. G. M. de & Krug J Predictability of evolution depends nonmonotonically on population size. PNAS 110, 571–576 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Orr HA The Population Genetics of Adaptation: The Adaptation of Dna Sequences. Evolution 56, 1317–1330 (2002). [DOI] [PubMed] [Google Scholar]
- 87.Bailey TL DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics (Oxford, England) 27, 1653–1659 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.de Boer CG & Hughes TR YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic acids research 40, D169–79 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Kent WJ BLAT—The BLAST-Like Alignment Tool. Genome Res 12, 656–664 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Cherry JM et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res 40, D700–D705 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Smith JD, McManus KF & Fraser HB A novel test for selection on cis-regulatory elements reveals positive and negative selection acting on mammalian transcriptional enhancers. Mol. Biol. Evol 30, 2509–2518 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Liu J & Robinson-Rechavi M Robust inference of positive selection on regulatory sequences in the human brain. Sci Adv 6, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Rice DP & Townsend JP A test for selection employing quantitative trait locus and mutation accumulation data. Genetics 190, 1533–1545 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Denver DR, Morris K, Lynch M & Thomas WK High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature 430, 679–682 (2004). [DOI] [PubMed] [Google Scholar]
- 95.Virtanen P et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17, 261–272 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Thompson DA et al. Evolutionary principles of modular gene regulation in yeasts. eLife 2, e00603 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Yassour M et al. Strand-specific RNA sequencing reveals extensive regulated long antisense transcripts that are conserved across yeast species. Genome Biology 11, R87 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Grabherr MG et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29, 644–652 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Wu TD & Watanabe CK GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005). [DOI] [PubMed] [Google Scholar]
- 100.Wapinski I, Pfeffer A, Friedman N & Regev A Natural history and evolutionary principles of gene duplication in fungi. Nature 449, 54–61 (2007). [DOI] [PubMed] [Google Scholar]
- 101.Li B & Dewey CN RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Yates AD et al. Ensembl 2020. Nucleic Acids Res 48, D682–D688 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.DiCarlo JE et al. Genome engineering in Saccharomyces cerevisiae using CRISPR-Cas systems. Nucleic Acids Res 41, 4336–4343 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Fleiss A et al. Reshuffling yeast chromosomes with CRISPR/Cas9. PLoS Genet 15, e1008332 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Horwitz AA et al. Efficient Multiplexed Integration of Synergistic Alleles and Metabolic Pathways in Yeasts via CRISPR-Cas. Cell Syst 1, 88–96 (2015). [DOI] [PubMed] [Google Scholar]
- 106.Livak KJ & Schmittgen TD Analysis of relative gene expression data using real-time quantitative PCR and the 2(-Delta Delta C(T)) Method. Methods 25, 402–408 (2001). [DOI] [PubMed] [Google Scholar]
- 107.Vandesompele J et al. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol 3, RESEARCH0034 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Teste M-A, Duquenne M, François JM & Parrou J-L Validation of reference genes for quantitative expression analysis by real-time RT-PCR in Saccharomyces cerevisiae. BMC Mol. Biol 10, 99 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Mardones W et al. Rapid selection response to ethanol in Saccharomyces eubayanus emulates the domestication process under brewing conditions. Microb. Biotechnol (2021) doi: 10.1111/1751-7915.13803. [DOI] [PMC free article] [PubMed]
- 110.Ibstedt S et al. Concerted evolution of life stage performances signals recent selection on yeast nitrogen use. Mol. Biol. Evol 32, 153–161 (2015). [DOI] [PubMed] [Google Scholar]
- 111.Rich MS et al. Comprehensive Analysis of the SUL1 Promoter of Saccharomyces cerevisiae. Genetics 203, 191–202 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Rest JS et al. Nonlinear fitness consequences of variation in expression level of a eukaryotic gene. Mol. Biol. Evol 30, 448–456 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Bergen AC, Olsen GM & Fay JC Divergent MLS1 Promoters Lie on a Fitness Plateau for Gene Expression. Mol. Biol. Evol 33, 1270–1279 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Alstott J, Bullmore E & Plenz D Powerlaw: a Python package for analysis of heavy-tailed distributions. PLoS One 9, e85777 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data generated for this study are available on NCBI’s GEO, accession numbers GSE163045 and GSE163866. All models and processed data are available on Zenodo at https://zenodo.org/record/4436477.