Heritable variation in gene expression is common within and between species. This variation arises from mutations that alter the form or function of molecular gene regulatory networks that are then filtered by natural selection. High-throughput methods for introducing mutations and characterizing their cis- and trans-regulatory effects on gene expression (particularly, transcription) are revealing how different molecular mechanisms generate regulatory variation, while studies comparing these mutational effects to variation seen in the wild are teasing apart the role of neutral and non-neutral evolutionary processes. This integration of molecular and evolutionary biology allows us to understand how the variation in gene expression we see today came to be and to predict how it is most likely to evolve in the future.
Keywords: gene regulatory network, transcription, evolution, mutation, cis-regulation, trans-regulation
The regulation of gene expression is a critical step in translating genotypes into phenotypes. Variation in this regulation is common within and between species1, and contributes to trait diversity. For example, changes in the regulation of gene expression have been shown to contribute to divergent pigmentation in plants and animals2,3, polymorphic body size in mice4, sporulation rate in domesticated yeast5, and many other morphological, physiological, and behavioural traits6,7, including disease states in humans8. Understanding how regulatory variation arises and evolves is thus critical for understanding many aspects of biology.
Genetic variation that affects the activity of regulatory networks underlies variation in gene expression. These networks include interactions among proteins, RNAs, and DNA sequences. Transcription factor proteins and DNA sequences such as enhancers and promoters are most often considered to define the structure of gene regulatory networks9,10, but protein-protein interactions, signaling pathways, and even metabolic states can also impact their activity11. Mutations that alter any of these elements can give rise to variation in gene expression. Such mutations can be classified as either cis-acting or trans-acting12: cis-acting mutations alter expression of a gene located on the same chromosome and tend to be located close to the affected gene, whereas trans-regulatory mutations have effects on gene expression that are mediated by diffusible molecules (such as RNAs and proteins) and can be located anywhere in the genome. Both types of mutations contribute to variation in gene expression, but differences in their molecular mechanisms suggest that they might contribute unequally to regulatory variation over evolutionary time.
Genomic studies describing variation in gene expression and the relative contributions of cis- and trans-acting variants have now been performed for diverse plant, animal, and microbial species13. As with all traits, this variation reflects the introduction of new genetic variants by mutation, the filtering of these variants by natural selection, and the chance survival of variants mediated by genetic drift. The extent to which each of these processes shapes the variation we see in wild populations, however, remains difficult to discern. For example, if one gene shows more variation in its expression than another, it might be because expression of the first gene is under less selective constraint or because a greater fraction of new mutations alters its expression (among other possibilities). Studies investigating the role of selection in shaping regulatory variation have thus far relied heavily on assumptions about the effects of new mutations because little empirical data was available14–17. However, this knowledge gap is beginning to close as recent advances in DNA synthesis, genome editing, and high-throughput expression analysis allow regulatory mutations to be generated and characterized on a large scale18.
Here, we examine our current understanding of the molecular and evolutionary processes generating variation in gene expression. We focus on variation in RNA expression because this is where the most data are available; quantifying variation in protein expression levels remains much more technically challenging. We begin by briefly reviewing studies describing the relative contributions of cis- and trans-regulatory variation to variation in gene expression. We then discuss the molecular sources of this regulatory variation, including studies describing the effects of mutations in these sequences as well as their contributions to expression differences within and between species. Finally, we close by showing how contrasting the effects of new mutations and genetic variants segregating in natural populations reveals the evolutionary processes responsible for the evolution of gene expression.
Partitioning cis- and trans- regulatory variation
Distinguishing between cis- and trans-regulatory variation reveals the relationship between mutations and their effects on gene expression. Two general strategies have primarily been used to disentangle the effects of cis- and trans-regulatory variants on a genomic scale. The first approach uses allele-specific expression (ASE) in F1 hybrids to compare activity of cis-regulatory alleles in a common trans-regulatory background to expression in the parents of the F1 hybrid19. The second strategy uses statistical associations between genetic variants and gene expression to identify quantitative trait loci affecting gene expression (eQTL)20,21. These two approaches provide complementary information about cis- and trans- regulatory variation, with the first capturing the net effect of all cis- and trans- regulatory variants, and the second providing information about the effects of individual loci.
Studies using ASE to estimate the relative contributions of cis- and trans-regulatory variants to variation in gene expression have been conducted in a variety of taxa, including plants22–25, yeast26–29, mice30,31, birds32,33, wasps34, and flies35–38. These studies include analysis of gene expression among individuals from outbred populations, between more isolated strains of the same species, and between species, each of which captures the evolution of gene expression at a different stage in the evolutionary process. Within species, trans-regulatory variants seem to contribute more to variation in gene expression than cis-regulatory variants13,28,29,39,40. This pattern has been suggested to be due to a larger mutational target size for trans-regulatory variants41: that is, there are more places in the genome where a mutation can affect a gene’s expression in trans than in cis. Trans-acting variants are also often assumed to affect expression of more genes on average than cis-acting variants. However, cis-regulatory variants often make similar24,37,38,42 or greater24,31,35 contributions to gene expression divergence between species. Studies directly comparing the relative contributions of cis- and trans-regulatory variants to expression divergence suggest that the relative contribution of cis-regulatory variants increases with divergence time29,37 (Figure 1A, B). This increasing cis-regulatory contribution can be explained by cis-regulatory variants being either more beneficial28,43 and/or less deleterious39 than trans-regulatory variants, which might result from differences in their average pleiotropy, as discussed more below.
Figure 1: cis- and trans-regulatory contributions to expression differences between and within species.
(a,b) An analysis of allele-specific expression in hybrid yeast (Saccharomyces) species with a range of divergence times (a, branch lengths reflect relative divergence times) showed increasing contributions of cis-regulatory variation to expression differences with increasing divergence time (b, notches in the boxplot indicate 95% CI of the median). (c,d) A highly-powered study of eQTL in Saccharomyces cerevisiae shows how the number of eQTL affecting expression varies among genes (c) and that putatively cis-acting eQTL end to have larger effects than trans-regulatory eQTL (d). Panel (b) reproduced with permission from Coolon et al.37, and panels (c) and (d) reproduced with permission from Albert et al.47
Studies identifying eQTL contributing to variation in gene expression have been conducted in a similarly diverse array of taxa12,44–46. Data from such studies provide insight into the number, location, and effects of regulatory variants within the genome and have shown that variation in gene expression is typically polygenic, with multiple variants contributing to variation in expression of most genes. For example, a study of the baker’s yeast, Saccharomyces cerevisiae, with 90% power to identify eQTL explaining 2.5% or more of the variation in a gene’s expression, found a median of 6 eQTL affecting expression of individual genes, with a max of 21 eQTL47 (Figure 1C). eQTL often span relatively large genomic regions and may contain multiple genetic variants, making identifying causal variants difficult. Approaches that increase the number of recombination breakpoints can be used to obtain higher resolution48. For example, eQTL mapping experiments that incorporate more than one generation of recombination to break up linked sites followed by bulk segregant analysis of individuals with extreme phenotypes49 have found even more eQTL, with over 100 eQTL affecting expression of a single gene (TDH3) in S. cerevisiae50.
eQTL located close to the affected gene (that is, proximal) are often considered cis-acting whereas eQTL located further from the affected gene (that is, distal) are often considered trans-acting20. Consistent with this assumption, proximal eQTL often have allele-specific effects on gene expression51. Indeed, the largest study of eQTL to date, which was conducted by the Genotype-Tissue Expression (GTEx) consortium and surveyed gene expression in cells derived from 49 tissues from up to 838 humans, has shown a strong correlation between the estimated effect of eQTL designated as cis-acting and allele-specific measures of expression in heterozygous individuals52. Several eQTL studies have reported that the majority of heritable expression variation is explained by trans-acting eQTL53–55, some of which affect the expression of many genes and are known as “hotspots”47,56–58 the GTEx study detected at least one cis-acting eQTL for nearly 95% of protein coding genes, whereas, trans-acting eQTL were detected for only 121 protein coding genes. The number of individuals surveyed for each tissue was a strong predictor of the number of trans-acting eQTL detected, however, underscoring the importance of taking statistical power into account when comparing the number of trans-acting eQTL reported among studies52. The unequal power for detecting cis- and trans-regulatory variants must also be considered when comparing eQTL: systematically testing for trans-regulatory variants requires many more statistical tests and thus a greater multiple testing burden than cis-regulatory variants. For this reason, some eQTL studies have focused solely on identifying cis-eQTL48,59.
Relative effect sizes of putatively cis- and trans-eQTL can be more fairly compared. Such comparisons tend to show that cis-eQTL have larger effects on gene expression than trans-eQTL46,57. For example, in the GTEx study, more cis- than trans-acting eQTL caused a two-fold or greater change in gene expression52. Similarly, in a recent, highly powered eQTL mapping study between two strains of S. cerevisiea the average cis-eQTL also explained more of the expression variation than the average trans-eQTL47. But genes are often regulated by multiple trans-regulatory variants, and sets of trans-eQTL affecting expression of the same gene tend to explain more of that gene’s expression variation than its cis-eQTL47,53,54 (Figure 1D). This observation is consistent with the greater combined contribution of trans-regulatory variation to polymorphic gene expression inferred using ASE.
Although ASE and eQTL studies reveal the relative contributions of cis- and trans-regulatory variation, they provide little insight into the specific genetic changes and molecular mechanisms altered by this variation. Only when such studies reach single variant resolution can they provide this type of insight, which is necessary for a complete understanding of why the patterns of regulatory variation we see today exist60. In the next two sections, we examine the molecular processes that give rise to cis- and trans-regulatory variation in more detail, highlighting studies examining the impact of mutations on these sequences as well as those that investigate their contribution to the evolution of gene expression.
Mechanisms generating cis-regulatory variation
cis-regulatory variation arises from genetic changes affecting sequences controlling expression of a particular allele of a gene. These sequences include the core promoter and enhancers of the gene, which both contain binding sites for transcription factors, chromatin structure influencing the accessibility of DNA to transcription factors, and sequences in the RNA transcript that affect its structure, stability, or translation. Below we discuss each of these components as a source of cis-regulatory variation.
Core promoters
At the most proximal level, a gene’s expression is controlled by its core promoter sequence, which contains binding sites for the general transcription factors necessary for transcription (Figure 2). Core promoter sequences typically lie close to the transcription start site, for example within 300 bp in humans61. Some of these core promoters contain discrete sequences with consistent positioning such as the TATA box or the downstream core promoter element, whereas others are enriched for sequence motifs such as CpG islands that are distributed over a broader region61,62.
Figure 2. Sources of cis-regulatory variation in eukaryotes.
Mutations (indicated with lightning bolts) affecting the core promoter (including in motifs such as the TATA box used to assemble the transcription machinery activating RNA polymerase), enhancers (whose functional units are transcription factor binding sites (TFBS)), chromatin accessibility (altered by nucleosome placement and stability) can have cis-regulatory effects on gene expression. Mutations that affect the splicing, stability, and/or translation of mRNA in an allele-specific manner can also be sources of cis-regulatory variation.
High-throughput mutagenesis studies assaying the effects of thousands of single-nucleotide changes on activity of core promoters show how variation in these sequences might contribute to regulatory variation within and among species. One of the first such studies63 used a massively parallel reporter assay (Box 3) to assess the impact of cis-regulatory mutations in core promoters from bacteriophage and humans, with activity determined using in vitro transcription assays. The largest effect mutations were located within TATA boxes and initiator regions overlapping the transcription start site. Outside of these motifs, most mutations had no statistically significant effect. However, a more recent, more highly-powered, study of core promoters in humans assayed the activity of various promoter alleles after integration into the genome of a human cell line and found that sequences outside of these key regions can also harbor genetic variation impacting promoter activity64. Studies of mutations in core promoters of the baker’s yeast Saccharomyces cerevisiae have also described a broader distribution of mutations within the promoter that have significant effects65.
Box 3. Surveying effects of cis-regulatory mutations.
Determining the distribution of mutational effects for a cis-regulatory sequence requires generating many alleles of the cis-regulatory element (ideally with each allele carrying a single mutation) and then assaying the ability of each allele to drive gene expression in a cell. Mutant alleles can be generated by programmable DNA synthesis on microarrays179, synthesis of DNA fragments with degenerate positions, error-prone PCR, or site-directed mutagenesis. After cloning these fragments upstream of a reporter gene or DNA barcode, and introducing these alleles into a cell (either in cell culture or by injecting into living organisms), expression of the reporter gene or barcode is measured. If the reporter gene is fluorescent, expression can be measured using flow cytometry or microscopy. If a barcode is used, expression is quantified based on the number of copies of each barcode observed in an RNA-seq experiment180.
Experiments coupling the high-throughput production of mutant alleles with a high-throughput readout of expression using barcodes are often referred to as massively parallel reporter assays181. Briefly, as shown in the figure, a library of regulatory element alleles is synthesized on an array, these DNA sequences are integrated into plasmids bearing unique DNA barcodes, and then these trasmids are transformed into cells. Finally, RNA-seq is used to measure expression of the barcode driven by each allele of the regulatory element. Using this technique, thousands of mutant alleles for one or multiple cis-regulatory elements can be assayed simultaneously. However, because alleles are not integrated into the genome, this experiment might not accurately predict the effects of cis-regulatory mutations in their native genomic contexts63,85. By contrast, studies using reporter genes are more likely to integrate cis-regulatory alleles into the genome and tend to have greater power to detect small changes in expression, but typically survey fewer cis-regulatory elements and mutations. Reporter genes that can be assayed in many single cells also make it easier to examine the impact of mutations on expression noise88. The next frontiers for this work are increasing the scale of reporter-gene experiments, increasing the sensitivity of single-cell bar-coding strategies, and adding spatial information for expression in multicellular organisms182.
Despite the potential for core promoters to contribute to expression divergence, key elements of their sequence61, histone marks66, and function65 are often highly conserved among species. This conservation is presumably driven by the requirement for a functional promoter to express a gene as well as the strong functional constraints on proteins that bind to these sequences because they regulate so many different genes. Indeed, sequences within promoters that serve as binding sites for general transcription factors, such as TATA boxes, are the most highly conserved portions of mammalian core promoters61. However, a comparison of core promoter sequences between human and rhesus macaque suggested that core promoters for a small number of genes might be diverging due to positive selection67, and other work has shown that the gain and loss of core promoters contributes to expression divergence between mouse and human68. Furthermore, even if variation in the core promoter itself is not the source of expression divergence, the structure of the core promoter can still influence expression divergence. For example, the presence of a TATA box69,70, nucleosome positioning in the core promoter71, and tandem repeats in the core promoter sequence have all been shown to correlate with expression divergence in yeast72.
Compared to core promoters, enhancers are typically located further from the transcription start site in either upstream (5’), downstream (3’) or intronic regions73 (Figure 2) and seem to more often be the source of cis-regulatory variation affecting gene expression74–76. Because enhancers regulate gene expression in a more time-, tissue-, or environment-specific manner than core promoters, they are expected to be subject to less functional constraint due to pleiotropy77 and thus more evolvable78. Indeed, histone marks commonly associated with enhancers show greater divergence among mammalian species than histone marks associated with core promoters66. Although single cell organisms such as S. cerevisiae lack enhancers, they have upstream activating and repressing sequences that often work in a similarly context-dependent manner79.
Transcription factor binding sites
The primary functional units within all of these cis-regulatory DNA sequences are binding sites for transcription factors, which can activate or repress transcription80. These sequences are short, degenerate, and able to evolve relatively quickly, even from random sequences81,82. Mutations that change the identity, affinity, orientation, number, and/or spacing of transcription factor binding sites (TFBSs) can alter cis-regulatory activity75,83,84. Large-scale mutagenesis studies of enhancers and other similar cis-regulatory elements have shown that although many mutations in these sequences can alter gene expression, mutations in TFBSs tend to have the largest effects85–88. Although TFBSs are often among the most highly conserved sequences within an enhancer89–92, they can also harbor genetic changes responsible for variation in gene expression within93,94 and between species95,96,97. However, in most cases where functional changes have been mapped to enhancers or similar cis-regulatory sequences, the specific genetic changes responsible for altering their function have not yet been identified6,98–100.
Chromatin accessibility
For a TFBS to regulate expression of a gene, the transcription factor it binds must be able to access the DNA sequence. In eukaryotes, DNA is packaged into chromatin by wrapping it around a complex of histone proteins known as a nucleosome, which can interfere with this access (Figure 2). Compeitition between nucleosomes and transcription factors for interactions with cis-regulatory DNA sequences can thus affect gene expression101–103, making genetic differences affecting chromatin structure another potential source of cis-regulatory variation24. Indeed, different patterns of nucleosome positioning at promoters have been shown to correlate with expression plasticity, species-level expression divergence, and the effects of new mutations on gene expression71,104, indicating that the pattern of nucleosome occupancy and stability at the promoter could play an important role in shaping evolutionary trajectories72.
Direct evidence of changes in chromatin structure contributing to the evolution of gene expression remains scarce, but is starting to accumulate. For example, in flies, combining information about chromatin accessibility and TFBSs explained expression divergence between Drosophila melanogaster and Drosophila virilis better than considering TFBSs alone105. In yeast, divergent chromatin structure has also been shown to correlate with divergent gene expression106,107, but most differences in nucleosome positioning between species are outside of regulatory regions and do not correlate with expression divergence108. However, in at least some cases, changes in chromatin structure seemed to have been offset by compensatory changes in TFBS exposed by the change in nucleosome position108,109.
Post-transcriptional sources of cis-regulatory variation
Although core promoters, enhancers, and chromatin accessibility are the most often discussed sources of cis-regulatory variation, they are not the only ways by which allele-specific variation in gene expression arises110. For example, variation in splice sites can have allele-specific effects on splicing of mRNA111–114; variation in polyadenylation signals can alter mRNA stability, translation, and location within the cell115; and variation in the 3’ UTR can affect mRNA degradation rates116 as well as regulation by microRNAs117. Sequence variation within the mRNA can also affect ribosome occupancy and translation efficiency118. Future work focusing on these post-transcriptional mechanisms is needed to more fully evaluate their relative contributions to regulatory evolution.
Mechanisms generating trans-regulatory variation
Whereas cis-regulatory variants tend to lie near the affected gene, trans-regulatory variants affecting a gene’s expression can be located virtually anywhere in the genome. These potential sites of trans-regulatory variants include both coding and non-coding sequences that affect expression or activity of gene products that regulate the focal gene’s expression either directly (by binding to its cis-acting sequences) or indirectly (by influencing the activity of direct regulators)58 (Figure 3). This large potential target size for trans-regulatory variants makes it difficult to interrogate them by targeted analysis of candidate regions. Rather, genome-wide mutagenesis and mapping strategies are needed to introduce and characterize trans-regulatory variants, often requiring follow-up experiments to separate the effects of causal variants from linked loci11,47.
Figure 3. Sources of trans-regulatory variation.
Mutations (indicated by lightning bolts) that can affect expression of a gene via diffusible molecules are trans-acting. These mutations can occur in non-coding or coding sequences of transcription factors, cellular sensors, transporters, and other molecules that influence transcription of many genes via effects on the many interconnected cellular networks.
Coding and non-coding sequences
Although the effects of trans-regulatory variants are mediated by diffusible molecules such as RNAs or proteins, studies of regulatory variation segregating in humans suggest that most trans-acting variants are not located within the sequences encoding these molecules114. Instead, in large-scale genome-wide association studies (Box 2), the majority of trans-regulatory variants have been found in non-coding, putatively cis-regulatory sequences controlling the gene’s expression52,114,119,120. By changing expression of the gene they affect in cis, such variants can impact the expresssion of other genes in trans53,56,114,121. For example, a cis-acting eQTL located near the gene encoding lysozyme (an enzyme that breaks down bacterial cell walls) has been shown to also act as a trans-acting eQTL for expression of other genes in monocytes122. Similarly, a cis-acting eQTL near the transcription factor KLF14, which regulates expression of genes in adipose tissue, explains trans-acting effects observed on expression of other genes123.
Box 2. Using genetic associations to localize cis- and trans-regulatory variants.
Specific genetic changes impacting gene expression can be localized within the genome using genetic mapping approaches with gene expression phenotypes20,21. These strategies rely on statistical associations with the effects of variants in different parts of the genome separated from each other by recombination. This recombination can come from two (or more) parental strains (e.g. P1 & P2 in figure) being crossed in a controlled manner (QTL mapping) to produce heterozygous F1 progeny which are then further crossed to produce a segregant panel. Alternatively, instead of experimentally generating recombinants, and thus capturing allelic and phenotypic variation between two strains, one can instead rely on existing genetic diversity within a population sample and perform a Genome Wide Association Study (GWAS, see figure). In each case, individuals within the segregant panel or population sample are genotyped and phenotyped allowing the detection of statistical associations between genetic variants and quantitative traits (in this case gene expression). Variants with statistically significant effects are called expression quantitative trait loci (eQTL). eQTL studies have been used to provide insight into the relative contributions of cis- and trans-regulatory variants to expression variation by designating each eQTL as (putatively) cis- or trans-acting based on its physical proximity to the gene whose expression it affects. Thus, associated variants proximal to the affected gene are commonly considered cis-eQTL and associated variants outside of a given cis- window are considered trans-eQTL. While this assumption often holds, it is possible for proximal variants to regulate the affected gene through a diffusible product (such as an RNA or protein) and for cis- acting variants to be located in distal enhancers, far from the gene they regulate. Because tests for cis-eQTL are typically restricted to variants in a small region of the genome close to the focal gene, and tests for trans-eQTL include all variants outside of this putatively cis-acting region, there is a much larger multiple testing burden, and thus lower statistical power, for identifying trans-eQTL. Despite these limitations, eQTL mapping is currently the best approach available for localizing regulatory variants within the genome.
However, studies of the baker’s yeast S. cerevisiae suggest that this species might have a different distribution of trans-regulatory variants in coding and non-coding sequences. As in humans, hotspot genes with trans-regulatory eQTL affecting expression of many genes are more likely to have a local, putatively cis-acting eQTL than expected by chance47, but the functional trans-regulatory variants mapped and validated in S. cerevisiae so far have primarily, although not exclusively, been in coding regions20,47,56,58,124,125. S. cerevisiae might have a higher proportion of trans-regulatory variants in coding sequences than humans because so much less of their genome is non-coding (27% in S. cerevisiae vs 97% in humans126); however, the higher proportion of coding variants might also be a consequence of often using a lab-adapted strain that carries many variants absent from wild populations127. Determining the true relative contributions of coding and non-coding variants to trans-regulatory variation in yeast (and other species) will require much more extensive mapping and functional testing of variants from natural populations.
If trans-regulatory variants generally do map to non-coding sequences more often than coding sequences, it might be because mutations in non-coding sequences tend to be less pleiotropic. For example, non-coding mutations that affect activity of a tissue-specific enhancer are expected to impact fewer traits than coding mutations altering the same gene’s protein sequence everywhere it is expressed16,76,78,128. Indeed, most trans-acting eQTL in human non-coding sequences seem more likely to affect enhancers than core promoters114, and often have tissue-specific effects53,114. Because mutations that are more pleiotropic are expected to typically be more deleterious than less pleiotropic mutations129, coding mutations might be selected against more strongly than non-coding mutations, reducing their frequency in natural populations. However, this paradigm is challenged by data showing that cis-regulatory sequences are more pleiotropic130, and protein sequences more modular131,132, than generally appreciated. Indeed, a recent study has shown how modularity in the yeast MATalpha2 transcription factor protein facilitated its divergence, which was then followed by changes in cis-regulatory, non-coding sequences of the genes it regulates133.
Transcription factors
Transcription factors (TFs) are proteins that bind to short sequences within cis-acting promoters and enhancers to regulate expression of a gene. They are often considered the most likely source of trans-regulatory variation, especially for hotspot eQTL, because most TFs regulate expression of many target genes134–138. Indeed, TFs do often seem to be responsible for hotspot eQTL in both humans120,139,140 and S. cerevisiae47,141. However, the ability of TFs to affect expression of multiple downstream target genes also results in functional constraint on their variation. Indeed, their protein coding sequences, DNA binding specificities, and general physiological roles are often conserved over long evolutionary timescales142. Despite these general trends of conservation, TFs can and do diverge in function, as changes in TF protein sequences, including those that affect their DNA binding specificity, have been reported for TFs controlling mating type in yeast,143,144, flower development and cell division in plants145, and body patterning in insects146,147, among others.
Sources of trans-regulatory variation other than transcription factors
Variants affecting genes not encoding TFs are also important sources of trans-regulatory variation. For example, chromatin regulators can have widespread effects on gene expression148, and an eQTL study in S. cerevisiae suggests that genes encoding these types of proteins harbor trans-acting eQTL affecting expression of many genes149. Functional studies in S. cerevisiae have also demonstrated trans-regulatory effects of variants in co-factors that modulate the activity of TFs150 as well as genes that influence metabolism such as the glucose receptor RGT258 and a membrane protein, SSY1, that senses the concentration of extracellular amino acids151. In humans, trans-eQJL have also been shown to map to genes that do not encode TFs, such as the Slco1a6 gene, in which a genetic variant was shown to alter expression of many genes by altering the transport of bile acids in pancreatic islets152. The diverse sources of trans-regulatory variation illustrated by these and other studies result from the interconnectedness of transcriptional, structural, signaling, and metabolic networks, and underscore the challenge of predicting and identifying trans-regulatory variants with our current understanding of systems biology11. They are also consistent with the proposed ‘omnigenic’ model of heritability, in which every gene expressed has the potential to influence every trait153. Ultimately, more functional tests of candidate trans-regulatory variants will be needed to fully understand the sources of trans-regulatory variation.
Surveying the effects of trans-regulatory mutations
Targetted mutagenesis strategies like those used to elucidate the effects of cis-regulatory mutations cannot be used for unbiased surveys of trans-regulatory mutations because trans-regulatory mutations can be located anywhere in the genome. trans-regulatory mutations are thus best surveyed by introducing mutations randomly throughout the genome and measuring their effects on gene expression. Two general strategies have been used to isolate the mutations needed to characterize the effects of trans-regulatory mutations: mutation accumulation and random mutagenesis (Box 4). Neither of these approaches distinguishes between mutations that act in cis or trans, but the vast majority of randomly introduced mutations affecting expression of a focal gene are expected to act in trans41, suggesting that cis-regulatory mutations captured in these studies are negligible. Indeed, studies of the TDH3 gene in S. cerevisiae have estimated that a random mutation is at least 265 times more likely to affect expression of this gene in trans than in cis154,155.
Box 4. Surveying effects of trans-regulatory mutations.
Because a trans-acting mutation can reside virtually anywhere in the genome, effects of trans-regulatory mutations are most efficiently surveyed by examining the effects of mutations introduced randomly genome-wide. Such mutations are generally collected using one of two strategies: (1) mutation accumulation or (2) random mutagenesis. With either strategy, effects of the mutations captured can be assayed for single genes using reporter genes or for the entire genome using RNA-seq.
Mutation accumulation studies collect spontaneous mutations arising over many generations in the near absence of natural selection156,157. Multiple independent lines are initiated from a single starting population (highly inbred, if not isogenic) and propagated with bottlenecks of 1 asexual or 2 sexual individuals each generation (see figure). These extreme bottlenecks allow selection to remove only lethal or sterile mutations. This strategy captures the full range of spontaneous mutations, but requires many generations of mutation accumulation to capture even a small number of mutations given that per base mutation rates are typically in the range of 10−8 to 10−10 per generation183mutation accumulation experiments tend to provide only sparse sampling of trans-regulatory mutations affecting expression of any given gene.
By contrast, random mutagenesis can introduce tens to hundreds of new mutations per cell in a single generation184,185. These mutations can be introduced by using chemical mutagenesis, DNA repair deficient strains, or activation of transposons. Mutations introduced by these methods, however, reflect only a subset of the types of mutations that arise spontaneously. For example, ethyl methanesulfonate (EMS), perhaps the most widely-used chemical mutagen, introduces almost exclusively G-to-A and C-to-T transitions186. Random mutagenesis approaches are thus an important complement to, rather than a replacement for, studying the effects of spontaneous mutations.
Mutation accumulation studies typically summarize the effects of new mutations on gene expression by estimating the mutational variance (Vm), which describes the increase in expression variance caused by new mutations each generation156,157. This parameter has been estimated genome-wide for two Drosophila species158–160, S. cerevisiae70 and the nematode Caenorhabditis elegans161,162. These data suggest that new mutations often have widespread effects on gene expression. For example, a 200 generation mutation accumulation experiment in D. melanogaster examined about 360 mutations in each of 12 independent strains and found that ~39% of genes showed significant expression variance among the mutation accumulation lines158. About one third of the genes in S. cerevisiae were also found to have significant expression variance among 4 independent lines from a mutation accumulation study lasting 4000 generations70. In general, mutation accumulation studies suggest that many mutations affect expression of multiple genes158–161, consistent with them often having trans-regulatory effects.
Mutagenesis studies that specifically examine a set of mutations affecting expression of a single gene are an important complement to mutation accumulation studies because they provide much deeper sampling of trans-regulatory mutations affecting the gene’s expression. (Mutation accumulation lines generally recover only a few mutations affecting expression of any particular gene.) Thus far, this mutagenesis approach has been used most extensively to study the distribution of mutational effects for trans-regulatory mutations altering expression driven by the promoter of the S. cerevisiae TDH3 gene154,155. These studies have shown, for example, that even though TDH3 is one of the most highly expressed genes in the genome, mutations increasing its expression are at least as common as mutations decreasing its expression. Using this same approach to characterize the effects of thousands of mutations on expression driven by promoters from 9 other S. cerevisiae genes showed how gene-specific distributions of mutational effects can differ in terms of skew, kurtosis, and dispersion, none of which are captured by Vm163.
These more focused studies of predominantly trans-acting mutations affecting expression of a particular gene also allow direct comparisons between the effects of cis- and trans-regulatory mutations affecting expression of the same gene. For example, a study comparing the effects of 235 cis-regulatory mutations in the S. cerevisiae TDH3 promoter to the effects of ~47,000 mutations spread throughout the genome showed that cis-regulatory mutations tended to have larger average effects on expression driven by the TDH3 promoter than trans-regulatory mutations154 These cis-regulatory mutations were also more likely than trans-regulatory mutations to decrease expression of this gene154 and to have dominant effects in diploid cells155,164. To the best of our knowledge, TDH3 is the only gene for which such comparable information on cis- and trans-regulatory mutations currently exists; however, if other genes show similar trends, these differences between cis- and trans-regulatory mutations, combined with the expected differences in pleiotropy described above, might explain the unequal contributions of cis- and trans-regulatory variants to the evolution of gene expression.
Mechanisms of evolutionary change
Understanding how new mutations generate variation in gene expression is critical for understanding how gene expression evolves because it allows us to predict how much variation in gene expression we should see after different amounts of evolutionary time due to neutral processes alone. That is, when a gene’s expression is evolving neutrally, mutations introduce new variants that can affect its expression and genetic drift fixes and eliminates these variants by chance, effectively sampling randomly from the distribution of mutational effects. However, when natural selection is acting on a gene’s expression, some regulatory variants are more likely to fix or be eliminated than others based on their effects, causing the distribution of mutational effects to differ from the distribution of effects observed for polymorphisms segregating within a species or divergent sites that differ between species (Figure 4). Comparing the effects of mutations to the effects of polymorphic and/or divergent sites is thus a powerful way to infer the effects of natural selection165. This general strategy has been used to infer the role of selection in generating variation in gene expression within and between species, first using mutational effects inferred from mutation accumulation studies158,161 and more recently using mutational effects derived from studies interrogating cis- and trans-regulatory mutations affecting expression of a particular gene more deeply50,88,166.
Figure 4: Using mutational effects to infer the action of natural selection.
Distinguishing between neutral and adaptive explanations for gene expression variation can be achieved by contrasting the effects of mutations (red, which shows the amount of expression variation expected to result from the accumulation of mutations in the absence of selection) and polymorphisms (blue, which shows expression variation affected by both neutral processes and selection). Dashed lines represent an effect size of zero (that is, no change in expression). If a gene’s expression is evolving neutrally (left panels), the effects of polymorphisms are expected to be consistent with a random sampling of effects from the mutational distribution: there should be no statistically significant difference between the distributions of effects for mutations (red) and polymorphisms (blue). By contrast, if expression of a gene is under stabilizing or directional selection, for example, the distribution of effects for polymorphisms will have lower variance than the distribution of mutational effects. The example shown here (right panels) is consistent with stabilizing selection, which maintains expression at its current level (that is, selection disfavors variants that either decrease or increase expression). Directional selection would also shift the mean effect of polymorphisms to higher or lower expression than the mean effect of mutations.
As described above, mutation accumulation studies typically measure the effects of mutations on gene expression in terms of mutational variance, Vm. For Drosophila spp., this estimate of how expression variance increases each generation was used to calculate the variance in gene expression expected to evolve under mutation-drift equilibrium for three pairs of Drosophila species158. Comparing the observed expression differences between these three pairs of species to this neutral expectation showed the expression divergence was substantially lower than predicted by the neutral model, suggesting that stabilizing selection had acted to reduce variation in gene expression levels158. A study comparing Vm for D. melanogaster to expression variation among strains of D. meianogaster reached the same conclusion160. Similarly, Vm estimated from four mutation accumulation lines of C. elegans maintained for 280 generations predicted more expression variance than was observed among five recently isolated lines of C. elegans separated by many thousands of generations161. These findings, combined with other types of analyses, have led to the prevailing view that stabilizing selection typically constrains variation in gene expression on a genomic scale17,167.
Gene-specific distributions of mutational effects are beginning to refine these analyses, allowing more specific questions to be addressed about the impact of selection on variation in gene expression. For example, effects of mutations in two human enhancers and one mouse enhancer assayed in mice85 were used to predict the effects of divergent sites in other rodent and primate lineages, showing evidence of different types of selection acting on each enhancer168. More direct comparisons between the effects of mutations and polymorphisms assayed in their native species have been performed for the S. cerevisiae TDH3 gene. Specifically, effects of cis-regulatory mutations in the TDH3 promoter on both gene expression level and gene expression noise were compared to the effects of polymorphisms in the TDH3 promoter observed among 85 strains of S. cerevisiae88 These data showed no evidence of selection acting on mean expression level, but did show evidence of stabilizing selection constraining expression noise88. Comparing the effects of these cis-regulatory mutations and polymorphisms in multiple environments also showed evidence of stabilizing selection acting to maintain a particular degree of expression plasticity for TDH3166. Finally, evidence of stabilizing selection was also seen when the effects of trans-regulatory mutations determined using mutagenesis were compared to the effects of polymorphisms affecting TDH3 expression inferred from eQTL mapping50.
Future directions
Molecular biology explains how new mutations give rise to variation in gene expression whereas population genetics explains how these new mutations might contribute to evolutionary divergence once they arise. We believe that both perspectives must be considered together to understand why we see the expression variation we see in the wild. Moving forward, we think it is important for the field to grow in at least three critical directions.
First, we think that more gene-specific distributions of mutational effects are needed for cis- and trans-regulatory mutations. Such work is required because new mutations are expected to have effects on gene expression that vary from gene to gene and between cis- and trans-acting mutations, but we have only begun to discover the range of these differences and do not yet know which properties of mutational effects are most important for accurately predicting polymorphism and divergence. New techniques such as saturation mutagenesis of regulatory elements169 and massively parallel genome editing to functionally validate trans-regulatory variants170 are making collection of such data more feasible at the scales necessary to answer these questions.
Despite these advances, it will likely never be practical to survey all genes and regulatory elements in all species. Consequently, the second critical direction is to understand how properties of regulatory networks shape distributions of mutational effects. We anticipate that such properties exist because the effects of new mutations on gene expression are determined by how they impact the structure and function of regulatory networks9. Indeed, a study comparing patterns of expression polymorphism and divergence to regulatory network structure in Drosophila spp found that genes regulated by a greater number of transcription factors were less likely to show variation in expression within and between species, presumably because the coordinate control of gene expression by sets of regulators tends to buffer the effects of mutations impacting activity of individual regulators171. This pattern might not be general though, as it was not observed among yeast species172 and no relationship was detected between loci harboring eQTL hotspots and network connectivity in S. cerevisiae47. Many questions remain, however, about the form and function of regulatory networks that might obscure these relationships11. The context-dependency of regulatory networks further adds to this challenge, as regulatory networks are expected to differ between cell types, genetic backgrounds, sexes, and environments. Yet here too, technical advances such as single-cell RNA-sequencing hold great promise for elucidating temporal and tissue-specific regulatory networks, and how they are impacted by new mutations173.
Once the effects of new mutations on gene expression are known or can be predicted, a third challenge is linking the changes in gene expression caused by these mutations to fitness and using the existing theoretical framework of population genetics to predict the evolutionary fate of different types of regulatory mutations. Fitness curves describing the relationship between expression of a gene and relative fitness are available for a few genes in S. cerevisiae174–176, but remain unknown for most genes in most species. Filling this knowledge gap will require more efficient ways to both modify gene expression and quantify fitness in many species. Despite this challenge, such data are key for connecting the too often disparate fields of molecular and evolutionary biology, which is essential for understanding the biological world as it exists now and how it is most likely to be in the future.
Box 1. Using allele-specific expression to disentangle cis- and trans-regulatory variation.
By definition, cis-regulatory variants have an allele-specific effect on gene expression, with a cis-regulatory variant altering expression of only the transcribed sequence located on the same chromosome. Consequently, when expression of two alleles of the same gene is compared in a single trans-regulatory environment - as is the case for two alleles within an F1 hybrid - differences in the abundance of RNA transcripts produced from the two alleles captures their relative cis-regulatory activity177. Comparing this relative cis-regulatory activity in F1 hybrids to the relative expression of the same alleles in the parental genotypes (P1 & P2) crossed to produce the F1 hybrid allows the effects of trans-regulatory variation to also be inferred19. Thus, using this approach (see figure), cis effects are detected when there is a significant difference in expression between the two alleles in the F1 hybrid (quantity H1), and trans effects are detected when the ratios of allelic expression in the parental (P1) and hybrid strains (H1) differ (P1 ≠ H1). With the advent of RNA-seq, allele-specific expression can be quantified genome-wide and the relative contribution of cis- and trans-regulatory variation to differences in gene expression assessed on a gene-by-gene basis. This general strategy can be used to characterize regulatory variation both within and between species, as long as there is allelic variation and the two parental genotypes can produce viable F1 hybrids.
The most significant limitation of this approach is that it is blind to the identity and genomic location of the cis- and trans-regulatory variants causing the observed regulatory effects. In addition, tests for cis-regulatory variation are typically more highly powered than tests for trans-regulatory variation because the former relies only on the measurements of allele-specific expression in the F1 hybrids whereas the latter compares this expression ratio in the hybrids to that between the parental genotypes. Thus, the number of parameters that can vary across biological replicates is higher when testing the effects of trans- than cis-regulatory variation. Care must also be taken to ensure independent estimates of the effects of cis- and trans-regulatory variation when testing for evidence of compensatory evolution28,178.
We thank members of the Wittkopp laboratory for helpful discussions during the drafting of the manuscript. Support for this work was provided by the John Simon Guggenheim Memorial Foundation, Alexander von Humboldt Foundation, National Science Foundation (DEB-1911322), and National Institutes of Health (R35GM118073) to P.J.W. and the National Institutes of Health Training Grant T32GM007544 to P.V.Z.
- Pleiotropy
The phenomenon whereby a single genetic variant affects multiple independent traits.
- Genetic Drift
Variation in allele frequencies caused by random sampling of individuals.
- Bulk segregant analysis
A technique used to associate genetic markers with trait variation by contrasting allele frequencies between two groups of individuals defined by differences in trait values.
- TATA box
An element of some promoter sequences that serves as a binding site for certain general transcription factors and is rich in T/A nucleotides.
- Core promoter element
Functional sequences proximal to the transcription start site that are sufficient to initiate transcription.
- CpG island
A region of the genome containing a large number of CpG dinucleotide repeats, found in the promoters of many mammalian genes.
- Initiator region
An element of core promoter sequences downstream of the TATA box which overlaps with the transcription start site.
- Gene expression noise
The variability of expression level among genetically identical cells in the same environment.
- Skew
A measure of the asymmetry of a distribution about its mean.
- Kurtosis
A measure of how much weight is concentrated in the tails of a distribution, relative to its center.
- Dispersion
The extent to which a set of values is clustered or dispersed, often measured by the variance or standard deviation of a distribution.
References cited
