Skip to main content
Molecular Biology and Evolution logoLink to Molecular Biology and Evolution
. 2017 May 29;34(9):2125–2139. doi: 10.1093/molbev/msx171

Intra and Interspecific Variations of Gene Expression Levels in Yeast Are Largely Neutral: (Nei Lecture, SMBE 2016, Gold Coast)

Jian-Rong Yang 1,, Calum J Maclean 1,, Chungoo Park 1,‡,, Huabin Zhao 1,§,, Jianzhi Zhang 1,*
PMCID: PMC5850415  PMID: 28575451

Abstract

It is commonly, although not universally, accepted that most intra and interspecific genome sequence variations are more or less neutral, whereas a large fraction of organism-level phenotypic variations are adaptive. Gene expression levels are molecular phenotypes that bridge the gap between genotypes and corresponding organism-level phenotypes. Yet, it is unknown whether natural variations in gene expression levels are mostly neutral or adaptive. Here we address this fundamental question by genome-wide profiling and comparison of gene expression levels in nine yeast strains belonging to three closely related Saccharomyces species and originating from five different ecological environments. We find that the transcriptome-based clustering of the nine strains approximates the genome sequence-based phylogeny irrespective of their ecological environments. Remarkably, only ∼0.5% of genes exhibit similar expression levels among strains from a common ecological environment, no greater than that among strains with comparable phylogenetic relationships but different environments. These and other observations strongly suggest that most intra and interspecific variations in yeast gene expression levels result from the accumulation of random mutations rather than environmental adaptations. This finding has profound implications for understanding the driving force of gene expression evolution, genetic basis of phenotypic adaptation, and general role of stochasticity in evolution.

Keywords: evolution, Saccharomyces, adaptation, genetic drift, transcriptome

Introduction

Evolutionary biology began from studies of organism-level phenotypes such as morphological, physiological, and behavioral traits. Darwin proposed that these variations, be they intra or interspecific, can primarily be explained by the adaptation of organisms to their respective environments (Darwin 1859), a view largely shared by modern biologists (Endler 1986; Mayr 2001; Futuyma 2013) (but see Nei 2013). However, at the genotype level, molecular evolutionists generally agree that most intra and interspecific variations in DNA sequences are more or less neutral (Kimura 1968; Nei 1987; Lynch 2007). This contrast between genotypic and phenotypic evolution is not logically inconsistent, because a genotypic change need not result in a phenotypic change or one with an appreciable fitness effect, even though a stably inherited phenotypic difference always requires a genotypic change.

If a genotypic change has a potential phenotypic effect, the realization of this potential usually requires gene expression, regardless of whether the genotypic change occurs in a coding region or a noncoding regulatory region. In other words, gene expression is usually the necessary bridge between genotypes and their corresponding organismal phenotypes. In this context, one ponders whether gene expression level, a molecular phenotype, is more like (organismal) phenotypes or (molecular) genotypes in its evolutionary pattern and mechanism. Specifically, we ask whether variations in gene expression levels within and between species result largely from neutral or adaptive evolution. Unfortunately, there is no unequivocal theoretical answer to this question, because an expression-level change may or may not appreciably impact the organismal phenotype and fitness. The available empirical data in the literature do not provide a clear answer either. For instance, although many presumably adaptive morphological variations in nature have been found to be caused by gene expression changes (Carroll 2008; Stern and Orgogozo 2008), this fact at most suggests that a large fraction of morphological adaptations are due to expression changes, but not that a large fraction of expression differences are adaptive. A neutral model of transcriptome evolution was previously proposed, on the basis of, among other things, an approximately constant rate of transcriptome evolution (Khaitovich et al. 2004). This observation could, however, be an artifact of using a human microarray to measure gene expression levels in other primates (Gilad et al. 2005). Furthermore, due to the lack of specific predictions of the adaptive hypothesis in the primate study, adaptation and neutrality could not be unambiguously distinguished.

To address whether variations in gene expression levels within and between species are largely neutral or adaptive, we compared the transcriptomes of nine yeast strains isolated from five different ecological environments. These strains belong to three closely related species separated from one another ≤20 million years ago (Dujon 2006). Importantly, we selected strains such that their genomic phylogenetic relationships mismatch their relationships of environmental origins. That is, some strains are phylogenetically relatively distant from one another but have similar ecological environments, whereas others are phylogenetically relatively close to one another but live in ecologically distinct environments. If gene expression variations among these strains result from the accumulation of neutral mutations, the relationships of their transcriptomes should mimic the genome-based phylogenetic tree. On the contrary, if the expression variations among these strains are largely shaped by adaptations to their respective environments, their transcriptomes should cluster according to their ecological environments, at least under simple models (see Discussion for justification). Thus, our design allows a distinction between the neutral and adaptive hypotheses. We quantified the yeast transcriptomes by RNA sequencing (RNA-seq) and discovered that these transcriptomes cluster more or less according to the genome-based phylogeny rather than their ecological environments. Only a tiny fraction of yeast genes exhibit similar expression levels among strains with a common ecological environment, no greater than that among strains with comparable phylogenetic relationships but different environments. Our findings strongly suggest that the vast majority of yeast gene expression variations result from neutral rather than adaptive evolution.

Results

The Transcriptome Tree Approximates the Genome Tree

To study the driving force of gene expression evolution, we chose nine yeast strains belonging to three Saccharomyces species (sister species S. cerevisiae and S. paradoxus, plus their outgroup S. mikatae) and originating from five different ecological environments (fig. 1A; see also supplementary table S1, Supplementary Material online). The natural habitat of yeast is thought to be the sap and bark of oak trees and adjacent soils (Sniegowski et al. 2002). Five of our nine strains were collected from this ecological environment and are referred to as wild strains. They include two S. cerevisiae strains, two S. paradoxus strains, and one S. mikatae strain. Four additional S. cerevisiae strains were, respectively, isolated from four other ecological environments and are referred to as nonwild strains.

Fig. 1.

Fig. 1.

Phylogenetic trees of the nine Saccharomyces yeast strains constructed using genome sequence, gene expression, and morphology data, respectively. The three species are indicated by different colors, while the ecological environments where the strains were isolated are shown by different symbols. (A) The genome tree of the nine strains based on the alignment of the coding sequences of 4,325 genes. Bootstrap percentages estimated from 1,000 replications are shown on interior branches. Asterisks indicate > 99.5% bootstrap support. The scale bar shows 0.01 nucleotide substitutions per site. (BD) The transcriptome tree of the nine strains based on standardized Euclidian distances in gene expression levels of all 4,325 genes (B), the 75% most highly expressed genes (C), and the 50% most highly expressed genes (D). Bootstrap percentages estimated from 10,000 replications are shown on interior branches. Asterisks indicate > 99.5% bootstrap support. The scale bar shows 0.1 unit of the standardized Euclidian distance per gene. (E) The morphology tree of nine strains based on standardized Euclidian distances in 219 morphological traits. Strains IFO1804, RM11, CLIB219, Y12, and YPS163 are used as proxies of N44, BC187, DBVPG6040, Y9, and YPS606, respectively. Bootstrap percentages estimated from 10,000 replications are shown on interior branches. Asterisks indicate > 99.5% bootstrap support. The scale bar shows 0.1 unit of the standardized Euclidian distance per trait. (F) Frequency distributions of topological distances (dT) between the genome tree and random tree topologies (grey), bootstrapped transcriptome trees with all genes (brown), bootstrapped transcriptome trees with the 75% most highly expressed genes (dark purple), bootstrapped transcriptome trees with the 50% most highly expressed genes (light purple), and bootstrapped morphology trees (blue), respectively. Each distribution is based on 10,000 random trees or bootstrapped trees. Arrows indicate the observed dT between the genome tree and various other trees based on the original (rather than bootstrapped) data. P value shows the probability with which the dT between the genome tree and a random tree topology is equal to or smaller than the observed dT between the genome tree and the tree being compared. (G) Frequency distributions of topological distances (dT) between the potential environment tree and bootstrapped genome trees (yellow), bootstrapped transcriptome trees with all genes (brown), bootstrapped transcriptome trees with the 75% most highly expressed genes (dark purple), bootstrapped transcriptome trees with the 50% most highly expressed genes (light purple), gene expression trees based on 533 individual GO categories (dark green), and bootstrapped morphology trees (blue), respectively, as well as frequency distributions of dT between three control environment trees and 533 GO-based gene expression trees, respectively (light green). The dT between a tree and the potential environment tree is defined by the minimal topological distance between the tree and any tree containing a monophyly of the five wild strains. In the three control environment trees, one or both S. cerevisiae wild strains in the aforementioned monophyly are swapped with their sister strains in the genome tree. Each distribution except for the bootstrapped genome trees (1,000 replications) and GO-based trees (533 GO categories) is derived from 10,000 bootstrapped trees. Arrows indicate the observed dT between the potential environment tree and various other trees based on the original (rather than bootstrapped) data. The P value is from a Z-test of the null hypothesis that the mean dT between 10,000 bootstrapped morphology trees and the potential environment tree is equal to or larger than that between the bootstrapped trees being compared and the potential environment tree.

On the basis of the available genome sequences of these nine strains (see Materials and Methods), we built a multiple sequence alignment of the coding sequences of each of 4,325 genes that have fully sequenced and reliable one-to-one orthologs across the nine strains. We concatenated the aligned sequences after removing all gaps and used them to reconstruct a neighbor-joining (NJ) tree (fig. 1A), which will be referred to as the genome tree. This tree is highly resolved, with all nodes having > 99.5% bootstrap support. As expected, the nine strains are clustered in the genome tree by species identity rather than ecological environment. Furthermore, within the S. cerevisiae clade, the two wild strains each have a nonwild sister strain. If gene expression variations among the nine strains are in a large part due to the accumulation of neutral mutations, the transcriptomes of these strains should cluster in a tree similar to the genome tree. If gene expression variations among the strains are largely caused by environmental adaptations, the five wild strains should cluster in the transcriptome tree, contrasting the genome tree.

To distinguish between the above two hypotheses, we quantified the genome-wide gene expression levels of the nine strains in the same growth medium and growth phase such that the revealed expression variations reflect genetic differences rather than phenotypic plasticity. The synthetic medium used mimics oak exudate (Murphy et al. 2006), rendering potential expression adaptations of the wild strains readily detectable. We used diploid strains, as yeast is naturally homothallic (Sniegowski et al. 2002; Johnson et al. 2004; Wang et al. 2012) and therefore usually diploid because gametes can switch mating type after dividing and mate with their daughter cells. RNAs were extracted from one exponentially dividing culture of each of the nine strains and a replicate culture of the S. cerevisiae wild strain YPS606 during exponential growth, and the standard Illumina-based RNA-seq was performed. A total of ∼744 million 52-nt single-end reads were obtained (supplementary table S1, Supplementary Matertial online); these reads were mapped to their respective genomes and used to estimate gene expression levels. Our experiments generated highly reproducible results, because gene expression levels quantified in the two biological replicates of strain YPS606 have a Pearson correlation coefficient of 0.984 (P < 10−300; supplementary fig. S1, Supplementary Material online). These two replicates were subsequently pooled in our analysis unless otherwise noted.

Our transcriptome analysis focused on the same 4,325 genes used to reconstruct the genome tree, each of which contains at least one read in at least one of the nine strains. Using the gene expression levels measured by RPKM (reads per kilobases per million reads), we calculated the standardized Euclidian distance per gene between each pair of the nine strains (see Materials and Methods) and then built an NJ tree, referred to as the transcriptome tree (fig. 1B). The overall topology of the transcriptome tree is similar but not identical to that of the genome tree. Importantly, we found the nine strains to cluster in the transcriptome tree by species identity rather than ecological environment. Although the transcriptome tree shows a different topology from that of the genome tree for the six S. cerevisiae strains, the two S. cerevisiae wild strains are not clustered in the transcriptome tree (fig. 1B). Almost identical tree topologies were obtained when TPM (transcript per million) (Wagner et al. 2012) or quantile-normalized gene expression levels (Ritchie et al. 2015) were used (supplementary fig. S2A and B, Supplementary Material online). Thus, the intraspecific topological differences between the transcriptome tree and genome tree do not appear to reflect environmental adaptations in gene expression evolution. When we separately analyzed the RNA-seq data from the two biological replicates of the YPS606 strains, the two replicates cluster in the transcriptome tree, as expected (supplementary fig. S2CE, Supplementary Material online).

Despite the high sequencing depths of our RNA-seq experiments, expression level estimates of lowly expressed genes are still less reliable than those of other genes. Because the standardized Euclidian distance gives equal weights to all genes, the inclusion of lowly expressed genes in the analysis increases the sampling error of the transcriptome tree. We thus repeated the above analysis using only the 75% (fig. 1C) or 50% (fig. 1D) most highly expressed genes after ranking all genes by their mean expression levels across all strains. Interestingly, using these relatively highly expressed genes renders the transcriptome trees even more similar to the genome tree. Specifically, they recovered the genome-tree-based relationships among some S. cerevisiae strains. This is not unexpected, given that 1) the differences between the transcriptome tree made using all genes and the genome tree seem random and 2) excluding lowly expressed genes should increase the signal to noise ratio for making the transcriptome tree.

To quantify the topological differences between the transcriptome tree made using all genes and the genome tree, we measured their topological distance (dT). The dT between two unrooted trees is twice the number of interior branches at which taxon partition is different between the two trees compared (see Materials and Methods). We found that dT = 8 between the genome tree and transcriptome tree. In comparison, we generated 10,000 random tree topologies among the nine strains and calculated their dT from the genome tree. We found that only 4.2% of these random topologies have a dT ≤ 8 (fig. 1F), suggesting that the small dT between the transcriptome tree and genome tree is unlikely to have been caused by chance. The dT values from the genome tree reduce to 6 and 4 for the transcriptome trees built using the 75% and 50% most highly expressed genes, respectively, and these dT values are again significantly smaller than expected by chance (P = 0.006 and 0.001, respectively; fig. 1F).

The exact topology of the environment tree that describes the relative similarities among the ecological environments of the nine strains is unknown, except that the tree should contain a monophyly of the five wild strains. We thus defined an environment tree set by all trees that satisfy the above condition. The dT between the genome tree and the potential environment tree was defined by the smallest topological distance between the genome tree and any tree in the environment tree set. We similarly defined the dT between the transcriptome tree and the potential environment tree. We found no difference between these two dT values (fig. 1G), indicating that the transcriptome tree is not closer than the genome tree to the potential environment tree. The same is true when only the 75% or 50% most strongly expressed genes are considered (fig. 1G).

While the above analyses found no evidence for the adaptive hypothesis of transcriptome evolution, it is important to confirm that our experimental design is able to detect adaptations if they exist. In this context, it is worth mentioning a yeast phenome study, which measured three growth characteristics in 200 different conditions for a number of strains and then clustered the strains by similarity in their growth characteristics (Warringer et al. 2011). Four of the five wild strains studied here (except DBVPG1788) form a monophyly in the growth traits-based tree (fig. 2 in Warringer et al. 2011), suggesting that phylogenetic clustering is able to detect at least some potential adaptations. As a further verification, we analyzed 219 morphological traits previously measured from fluorescent microscopic images of triple-stained yeast cells (Yvert et al. 2013; Ho et al. 2016). Using this dataset and controlling for mutational size, we recently discovered that morphological traits that are more important to organismal fitness tend to differ more between strains, strongly suggesting that the intra and interspecific morphological variations in this dataset have been shaped by adaptive evolution to a large extent (Ho et al. 2016). We thus subjected the yeast morphological data to the same phylogenetic analysis used for the transcriptome data. However, five of the nine strains with transcriptome data do not have morphological data. We chose five other strains that have morphological data as their proxies. Each of the proxies is ecologically equivalent (with one exception) and, based on the established genome trees (Liti et al. 2009; Maclean et al. 2017), phylogenetically close to the strain being replaced. The exception is the S. cerevisiae fruit juice strain DBVPG6040. Because this strain has no phylogenetically close and ecologically equivalent strain in the set of strains with morphological data, we chose a phylogenetically close wine strain (CLIB219) as its proxy. We then calculated standardized Euclidian distances between pairs of the nine strains using all 219 morphological traits and built an NJ tree (fig. 1E). Compared with the transcriptome tree, the morphology tree is more different in overall topology from the genome tree. For instance, morphological clustering of the strains is no longer strictly by species identity, because the six S. cerevisiae strains do not form a monophyletic clade. Furthermore, in contrast to the genome tree and transcriptome tree, the morphology tree unites the two wild S. cerevisiae strains in exclusion of all other strains. We found the dT between the morphology tree and genome tree to be 10, not significantly smaller than that between a random tree and the genome tree (P = 0.262; fig. 1F). This is not caused by the existence of two wine strains in the morphological data, because these two strains are not clustered in the tree. The dT between the morphology tree and potential environment tree is smaller than that between the genome tree and potential environment tree (fig. 1G). To examine if this difference is statistically significant, we generated 10,000 morphology trees and 10,000 genome trees by bootstrapping the 219 morphological traits and the 4325 genes, respectively. The mean dT to the potential environment tree is significantly smaller for the set of morphology bootstrap trees than the set of genome bootstrap trees (P = 0.0005, Z-test; fig. 1G). These observations confirm that our experimental design is able to detect potential signals of environmental adaptation.

Fig. 2.

Fig. 2.

Principal component analysis of the (A) genome sequences, (B) gene expression levels, and (C) morphological data of the nine yeast strains. The three species are indicated by different colors, while the ecological environments where the strains were isolated are shown by different symbols. Percentage variance explained by a principal component is indicated in the parentheses. In panel A, the inset shows an enlarged view of the boxed area.

We similarly used the bootstrap method with 10,000 replications to examine if the dT between the transcriptome tree and genome tree is significantly smaller than that between the morphology tree and genome tree. While this difference in dT is not statistically significant (P = 0.16; fig. 1F), the difference becomes significant when only the 75% or 50% most highly expressed genes are used in the transcriptome analysis (P = 6 × 10−4 and 2 × 10−5, respectively; fig. 1F). By the same approach, we found that the dT between any of the three transcriptome trees and the potential environment tree is significantly larger than that between the morphology tree and potential environment tree (P = 0.04, 0.05, and 0.05, respectively; fig. 1G). These results further support the hypothesis of neutral rather than adaptive evolution of yeast transcriptomes.

It is notable that, while the transcriptome tree resembles the genome tree in topology, the branch lengths of the two trees are not proportional to each other. This is unsurprising because we used the Euclidian distance to measure transcriptome differences, and the Euclidian distance is not known to be proportional to the underlying number of genetic changes or divergence time.

Principal Component Analysis Supports the Phylogenetic Results

In addition to the phylogenetic analysis, we performed a principal component analysis (PCA) of the genome sequence, transcriptome, and morphology data, respectively. For the genome sequence data, the nine strains are clearly separated according to species identity rather than ecological environment in the plane of the first two principal components (fig. 2A). Although to a lesser extent, the same can be said for the transcriptome data (fig. 2B;supplementary figs. S3A and B, Supplementary Material online). By contrast, there is no clear grouping of strains by species identity when the morphological data are analyzed (fig. 2C). For instance, the distances between some pairs of interspecific strains on the PCA plot are smaller than the distances between some pairs of intraspecific strains. Of special interest is that the distance between the two S. cerevisiae wild strains is much smaller than that between each of them and their respective sister strains or their proxies. The contrast between the morphology PCA plot and transcriptome PCA plot is not due to the much larger number of genes/traits in the transcriptome data, compared with that in the morphology data. This is because, even when only 219 randomly picked genes are used, the transcriptome PCA plot still displays a much stronger grouping of strains by species than does the morphology PCA plot (supplementary fig. S3C, Supplementary Material online). In summary, results from the PCA and phylogenetic analysis are consistent, both suggesting that the transcriptome variation among the nine strains is largely neutral.

Expression Variations of Functional Groups of Genes Are Consistent with Neutral Evolution

The above finding that the evolution of the transcriptome as a whole resembles genome sequence evolution suggests that the expression variations of most genes are likely neutral. However, this finding does not rule out the possibility that the expression variations of a minority of genes are caused by environmental adaptations. Specifically, these genes may be enriched in certain functional groups. To investigate this possibility, we downloaded the Gene Ontology (GO)-based gene functional annotations in GOslim (Cherry et al. 2012). For each molecular function, biological process, or cellular component GO category, we constructed an NJ tree using the expression levels of all genes belonging to the GO category as was done for all genes in fig. 1B. Of 533 GO categories examined, only one exhibits a monophyly of the set of all five wild strains. To examine if this observation is explainable by chance alone, we constructed three five-strain control sets by swapping one or both of the two wild S. cerevisiae strains with their respective nonwild sister strains shown in the genome tree. We found 3 (swapping DBVPG1788 with BC187), 1 (swapping YPS606 with Y9), and 1 (swapping both) GO categories for which the three five-strain control sets form a monophyly in the expression tree, respectively. Because the observed number (1) for the all-wild strain set is not larger than those (1–3) for the three control sets, we conclude that genes with adaptive expression variations, if they exist, are not significantly enriched in any GO category, which could happen if the number of such genes is small and/or these genes are more or less evenly distributed among GO categories. We further generated the distribution of dT between the expression tree of a GO category and the potential environment tree using all 533 GO categories (fig. 1G). It is clear that there is no enrichment of GOs with small dT in this distribution, compared with the corresponding distributions when one or both of the two wild S. cerevisiae strains are swapped with their respective genomic sister strains in expression trees (fig. 1G). Similar analyses were carried out for groups of genes belonging to the same biochemical pathways and groups of genes having the same deletion phenotypes, according to SGD annotations (Cherry et al. 2012). Again, expression trees of genes based on biochemical pathways or phenotypes are no closer to the potential environment tree than are the three negative controls aforementioned (supplementary fig. S2F, Supplementary Material online).

Expression Variations of Individual Genes Are Consistent with Neutral Evolution

The lack of significant GO enrichment of genes with potential adaptive expression variations prompted us to examine individual genes. For each gene, we calculated the standardized Euclidian distance between each pair of the nine strains, which is simply the absolute value of their standardized expression level difference for the gene (see Materials and Methods). We then used these distances to construct an expression tree by the NJ method. We were interested in expression trees in which the five wild strains form a monophyly, because such trees potentially result from adaptive expression evolution. Only 22 genes met this requirement. A careful examination of their expression variations, however, led to three unexpected observations. First, for each of these 22 genes, all five wild strains showed either higher or lower expressions than all four nonwild strains (fig. 3A). This is unexpected, because the five wild strains could also form a monophyly if they all have similar expression levels that are intermediate to those of nonwild strains, with some nonwild strains having higher expressions and others lower expressions. Second, if the expression variation of a gene is primarily due to environmental adaptations, given the first observation, each of the five wild strains should have the same probability to be the strain with the most dissimilar expression from those of the four nonwild strains. Surprisingly, for only one of the 22 genes is a wild S. cerevisiae strain’s expression most dissimilar from those of the nonwild strains, significantly below the expectation of 22 × (2/5) = 8.8 (P = 2 × 10−4, binomial test). Because all four nonwild strains belong to S. cerevisiae, the first two observations together strongly suggest that, even for these 22 genes, the vast majority still show a clear signal of expression clustering by species (fig. 3A), which is consistent with neutral evolution. Third, the coefficient of variation (CV) of gene expression among the six S. cerevisiae strains is smaller than the CV among the five wild strains for 15 of the 22 genes. Because the six S. cerevisiae strains are from five different environments whereas the five wild strains are all from one environment, the third observation is unexplainable by the simple environmental adaptation hypothesis but is consistent with the neutral hypothesis. Thus, the expression variations of even these 22 genes are unlikely to have been primarily caused by environmental adaptations.

Fig. 3.

Fig. 3.

Little evidence for environmental adaptation from the expression levels of individual genes. (A) Twenty-two genes whose expression levels support a monophyly of the five wild strains. Expression levels of the nine strains for each gene have been scaled to a standard normal distribution for comparison. The three species are indicated by different colors, while the ecological environments where the strains were isolated are shown by different symbols. (B) Number of genes for which the expression tree supports the monophyly of each of the 126 possible five-strain sets. The red dot shows the five-strain set composed of the five wild strains (“Five wild strains”), the three blue symbols show the three control sets in which one (“YPS606←→Y9” and “DBVPG1788←→BC187”) or both (“Swap both”) of the two wild S. cerevisiae strains in the “Five wild strains” set are swapped with their sister nonwild strains, the green circles show all other five-strain sets that include the three non-S. cerevisiae strains (“2 nonwild S.c. + 3 non-S.c.”), and the grey dots show all other five-strain sets (“All others”). (C) Number of genes whose expression tree supports the monophyly of each of the 84 possible six-strain sets. The red dot shows the six-strain set composed of the six S. cerevisiae strains (“Six S.c. strains”), while the grey dots show all other six-strain sets (“All others”). (D) Same as panel B except that coding sequence data instead of gene expression data are used. (E) Same as panel C except that coding sequence data instead of gene expression data are used. (F) Same as panel B except that morphological data instead of gene expression data are used. (G) Same as panel C except that morphological data instead of gene expression data are used.

To investigate the possibility that the clustering of the five wild strains by the expression levels for these 22 genes is by chance, we enumerated all 126 ways that five strains can be chosen from the nine strains. For each set of five strains chosen, we calculated the number of genes whose expression trees show a monophyly of these five strains, and referred to this number as the number of monophyly genes (NMG) for the five-strain set. We ranked NMG for the 126 five-strain sets and found that 100 of the 126 sets have NMG ≥ 22 (fig. 3B). Furthermore, the all-wild set (red dot in fig. 3B) has the second smallest NMG among the 15 five-strain sets that are composed of any two S. cerevisiae strains and the three non-S. cerevisiae strains (green and blue symbols in fig. 3B), suggesting that nonwild S. cerevisiae strains cluster more often with (the wild strains of) the other two species than do the wild S. cerevisiae strains at the level of gene expression. In addition, we found NMG to be smaller for the all-wild set than each of the three control sets mentioned in the previous section (blue symbols in fig. 3B). Because each control set is composed of five strains that have exactly the same phylogenetic positions as the five wild strains in the genome tree but share lower environmental similarities than the five wild strains do, the observation of a smaller NMG for the all-wild set than each of the control set indicates that having a common environment does not increase NMG, supporting the neutral hypothesis of expression evolution. Note that the above finding cannot be caused by a potential lack of information in the data, because if we choose six strains from the nine used, the number of genes supporting the monophyly of the six S. cerevisiae strains (red dot in fig. 3C) is the highest among all 84 possible six-strain sets (gray dots in fig. 3C). Apparently, our expression dataset contains rich information, but the information points to a neutral rather than adaptive explanation of gene expression variation.

For comparison, we repeated the above analyses using the yeast genome and morphological data, respectively. When using individual gene sequences to reconstruct the trees of the nine strains, we found NMG = 33 genes to support the monophyly of the all-wild set (red dot in fig. 3D). Twenty of the 126 possible five-strain sets and 14 of the 15 possible five-strain sets that are composed of the three non-S. cerevisiae strains and any two S. cerevisiae strains (green and blue symbols in fig. 3D) have NMG ≥ 33. NMG for the all-wild set is larger than that for one of the three control sets. As expected, a monophyly of the six S. cerevisiae strains (red dot in fig. 3E) is supported by more genes than any other six-strain set (grey dots in fig. 3E). Thus, results from the gene sequence data are overall similar to those from the gene expression data. If there is any difference, the expression data appear to support the monophyly of the all-wild set relative to other sets even less often than the sequence data.

On the contrary, for the morphological data, 15 of the 219 traits support the monophyly of the all-wild set (red dot in fig. 3F). This fraction (15/219 = 6.8%) is an order of magnitude greater than the corresponding fractions for genome (33/4325 = 0.76%) and gene expression (22/4325 = 0.51%) data (P = 3 × 10−9 and 3 × 10−11, respectively, G-test of independence). Only two of the 15 five-strain sets containing all non-S. cerevisiae strains and any two S. cerevisiae strains (green and blue symbols in fig. 3F) and 8 of all 126 five-strain sets are supported by ≥15 traits. Furthermore, the number of traits supporting the all-wild set exceeds that supporting each of the three control sets (blue symbols in fig. 3F). By contrast, the monophyly of all six S. cerevisiae strains, supported by only 27 morphological traits, is no longer the best supported among all six-strain sets.

Altogether, these phylogenetic analyses of individual gene sequences, expression levels, and morphological traits strongly suggest that the observed yeast expression variations within and between species are largely neutral.

Expression Variation among Wild Strains Exceeds That between Wild and Nonwild Strains

If gene expression variations among the yeast strains are mainly due to environmental adaptations, the expression level differences among the five wild strains should be relatively small, compared with those between wild and nonwild strains. On the contrary, if expression variations are mainly neutral, expression differences among strains should increase with their genomic distances. Consequently, under the neutral hypothesis and given the genome tree (fig. 1A), expression differences among wild strains are not expected to be smaller, compared with those between wild and nonwild strains. In other words, the neutral and adaptive hypotheses may also be tested by measuring expression differences of individual genes among various strains without making a tree. We used two measures of gene expression differences. First, for each gene, we calculated the mean pair-wise difference in expression level among all wild strains (PDww), as well as that between all pairs of wild and nonwild strains (PDwo). The smaller the ratio between PDww and PDwo, the stronger the evidence for adaptation. Second, for each gene, we also compared the variance in expression level among all wild strains (Vw) and that among all strains (Vt). Again, the smaller the ratio between Vw and Vt, the stronger the evidence for adaptation. We plotted the frequency distributions of these two ratios using the actual wild and nonwild strains (fig. 4A and B). As three controls, we plotted the frequency distributions when we swap one or both wild S. cerevisiae strains with their respective nonwild sister strains. But, neither PDww/PDwo nor Vw/Vt is smaller for the actual wild strains when compared with the three controls (P > 0.5, one-tail Mann–Whitney U test; fig. 4A and B), consistent with the neutral hypothesis.

Fig. 4.

Fig. 4.

Gene expression variances among wild strains and among all strains. (A) Frequency distribution of the logarithm of the ratio between the mean expression difference between wild strains and that between wild and nonwild strains (“Wild”, black bars). As controls, the same quantity is plotted when one (“YPS606←→Y9” and “DBVPG1788←→BC187”) or both (“Swap both”) of the wild S. cerevisiae strains are swapped with their sister nonwild strains in the calculation. The P value from Mann–Whitney U test, measuring the probability that the median value of the observed distribution (black) is equal to or greater than that of a control distribution, is indicated with the same color as the control distribution. (B) Frequency distribution of the logarithm of the ratio between the variance in expression level among the wild strains and that among all strains (“Wild”, black bars). As controls, the same quantity is plotted when one (“YPS606←→Y9” and “DBVPG1788←→BC187”) or both (“Swap both”) of the wild S. cerevisiae strains are swapped with their sister nonwild strains in the calculation. The P value from Mann–Whitney U test, measuring the probability that the median value of the observed distribution is equal to or greater than that of a control distribution, is indicated with the same color as the control distribution. (C) Same as panel A except that morphological data instead of gene expression data are used. (D) Same as panel B except that morphological data instead of gene expression data are used.

For comparison, we performed the same analysis with the morphological data. Interestingly, both PDww/PDwo and Vw/Vt are significantly skewed towards lower values when compared with the three controls (fig. 4C and D), consistent with the adaptive evolution hypothesis of morphological traits (Ho et al. 2016).

Discussion

In this study, we measured genome-wide gene expression levels in nine yeast strains belonging to three closely related species and isolated from five different ecological environments. We repeatedly found that the intra and interspecific variations in gene expression levels can be explained by the neutral accumulation of random mutations but are inconsistent with the simple environmental adaptation hypothesis. Our study has several caveats that are worth considering.

First, the quality of the gene expression data is critical to our conclusion. The sequencing depth in our RNA-seq experiment is high, with ∼70.4 million mapped reads per sample. For the gene with the median expression level in our data, there are on average 93 reads covering each nucleotide. Further, biological replicates show highly similar expressions (supplementary fig. S1, Supplementary Material online) and form a monophyly in transcriptome trees (supplementary fig. S3, Supplementary Material online). Our finding of neutral expression variations is not an artifact of a lack of statistical power in detecting adaptation, because 1) our finding is based on positive evidence for neutrality in addition to negative evidence for adaptation, and 2) adaptive signals were detectable in the morphological data despite that the morphological measurements are less accurate than the gene expression measurements.

Second, the laboratory condition under which gene expressions are quantified is important to the test of the neutral and adaptive hypotheses. The oak exudate medium used mimics the natural habitat of wild yeast strains; our previous study showed that, compared with some nonwild strains, a wild strain grows faster in this medium than in the other commonly used media tested (Qian et al. 2012). Thus, using this medium should help reveal environmental adaptations of gene expressions in the wild strains if such adaptations exist. The absence of adaptive signals even in this medium implies the unlikelihood of detecting adaptations in other conditions that are less similar to the natural habitat of the wild strains. Nonetheless, the synthetic oak exudate medium is not identical to the natural habitat of the wild yeast strains, which could have obscured the potential adaptive signals in gene expression variations. However, natural environments fluctuate and gene expression evolution includes the evolution of expression responses to environmental fluctuations, of which the oak exudate medium may be considered one. If gene expression variations are primarily adaptive, one cannot explain why the among-strain variation in the expression response to the oak exudate medium is structured like the genome tree of the strains. By contrast, this pattern of variation is expected if it is due to the random accumulation of neutral mutations. Thus, although the medium used in our experiment is not identical to the natural habitat of the wild yeast strains, our findings regarding the evolutionary mechanism of expression variations are informative. Of course, this conclusion should be further verified under other relevant conditions in the future. One might wonder why we included in our study four S. cerevisiae strains sampled from four different nonwild environments instead of, for example, two clinical and two distillery strains, which might be useful for studying adaptations to each of these two nonwild environments. The reason is that our expression profiling used only one medium that mimics the wild environment. Consequently, the expression data are not particularly powerful for detecting adaptations to nonwild environments when they do not even show signals of adaptation to the wild environment. In other words, the alternative sampling strategy would not be helpful. In the future, when expression data are collected from multiple media, such a sampling strategy would be helpful.

Third, although the ecological environments of the five wild strains are overall similar, these environments may still differ in temperature, humidity, day length, etc. Consequently, one could argue that the adaptive hypothesis does not necessarily predict similar gene expressions among the wild strains. While this argument may be valid, the adaptive hypothesis cannot explain the significant topological similarity between the transcriptome tree and the genome tree, because the differences among the environments of the nine strains are certainly not represented by the genome tree. The similarity between the transcriptome and genome trees, in contrast to the dissimilarity between the morphology and genome trees, strongly supports the neutral explanation of the expression variations among the nine yeast strains, particularly in the light of the recent finding of adaptive variations of the morphological traits examined (Ho et al. 2016).

Fourth, we assumed that environmental adaptation means that there is a single optimal expression level or a continuous range of equally optimal expression levels for a given gene in an environment. This assumption, however, may not be correct for all genes. Strains from the same ecological environment but with different genetic backgrounds could have different optimal gene expression levels, due to different genetic interactions that have accumulated since the strains diverged from one another. For example, let X and Y be two genes with virtually identical functions such that it is the total expression level of X and Y that is optimized by natural selection. Under this scenario, the optimal expression level of X will vary among wild strains depending on the expression level of Y. If this or other scenarios of genetic interactions apply to most genes in the genome, our analysis would be incapable of testing the adaptive hypothesis, because what is potentially selectively optimized is not the expression levels of individual genes but unknown mathematical functions of the expression levels of multiple genes. However, for the following reason, such genetic interactions cannot be widespread. We previously found that deleting certain genes in a lab strain of yeast can increase the yeast fitness under a given environment, which led us to predict that these genes should have been down-regulated in strains well adapted to that environment (Qian et al. 2012). Interestingly, this prediction is usually correct (Qian et al. 2012), suggesting that genes with fitness effects upon deletion are rarely subject to the type of genetic interaction aforementioned, because otherwise the prediction could not have been so good. This consideration suggests that the necessary assumption required for our rejection of the adaptive hypothesis is likely satisfied for most genes. Whether the same is true to mutations milder than gene deletion and in interspecific comparisons, however, remains an open question. In the case of morphological traits, the assumption apparently holds for most traits, because otherwise signals of adaptation predicted under this assumption should not have been detected.

While our study suggests the paucity of adaptive expression variations among the yeast strains studied, it does not exclude the possibility of a small fraction of genes whose expression variations are largely caused by adaptations. In fact, adaptive expressional differences among S. cerevisiae strains have been suggested for some genes (Bullard et al. 2010; Fraser et al. 2010; Qian et al. 2012). Furthermore, because we analyzed only one-to-one orthologous genes among the nine strains, adaptive expression evolution of nonone-to-one orthologous genes, including recently duplicated genes as well as orthologs of recently lost genes, remains untested, although a previous yeast study suggested that duplicate genes contribute to adaptation more often via protein function changes than expression changes (Qian and Zhang 2014). It should be pointed out, however, that identifying an adaptive signal in the expression difference of a gene between two strains does not necessarily mean that their expression difference is entirely or even largely explained by adaptation. The following hypothetical example illustrates this point. The optimal expression level for a gene in strain A is anywhere between 5 and 10 units, while its optimal expression level in strain B is anywhere between 25 and 100 units. If we observe that the expression level of the gene is seven units in strain A and 82 units in strain B, the fraction of their expression difference that is potentially adaptive is only (25 − 10)/(82 − 7) = 20%. Adaptive signals can be detected in some tests even when the fraction of expression variation explainable by adaptation is low (Qian et al. 2012), whereas some seemingly adaptive patterns, such as the similarity in the expression levels of the 22 genes among the wild strains in figure 3A, are better explained by neutral evolution upon a closer examination. If only a small fraction of expression differences between two populations are adaptive, the adaptive expression differences likely arise early in their environmental adaptations, because the fitness advantages of successive fixations of mutations in an adaptation process are expected to decline exponentially (Orr 1998). It is thus not surprising that three parallel populations of yeast under continuous aerobic growth in glucose-limited chemostats for 250–500 generations showed parallel expression changes for some genes (Ferea et al. 1999), although it is unclear whether the number of genes with parallel expression changes significantly exceeds the random expectation. Notably, even during early stages of adaptations, the amount of adaptive expression changes may still be limited, compared with neutral changes. For example, in a large-scale experimental evolution study that subjected eight populations of yeast to each of three conditions (glucose limitation, sulfate limitation, and phosphate limitation) for 100–400 generations of mitotic growth, populations selected under the same conditions do not form monophyletic clades in the transcriptome tree (Gresham et al. 2008). Regardless, these experimental evolution studies were not designed to distinguish between the neutral and adaptive hypotheses and hence do not provide critical evidence for or against each hypothesis.

Recently, Karen et al. measured the fitness effects on a laboratory yeast strain by individually altering the expression levels of ∼100 genes (Keren et al. 2016). Due to the limited sensitivity of the laboratory fitness quantification, a measured fitness effect <1% cannot be distinguished from no effect. We thus estimated the “neutral” expression range, for which the measured fitness effect is ≤1%. For 69% of the 78 genes that can be analyzed (Keren et al. 2016), the neutral expression range is wider than the actual expression range observed in the nine strains studied here, explaining why the observed expression variations within and between species appear neutral for most genes. The above comparison is of course approximate, because the media used in the laboratory fitness measurement differ from that used in our study, the genetic background stays unchanged in the fitness measurement but varies among our nine strains, and the sensitivity of the fitness assay is lower than that of natural selection. Nevertheless, the comparison shows that the direct fitness measures are not inconsistent with our conclusion.

Except for one early microarray study that suggested no purifying selection (Khaitovich et al. 2004), there is ample evidence for and general agreement on the action of purifying selection in gene expression evolution (Denver et al. 2005; Jordan et al. 2005; Rifkin et al. 2005; Liao and Zhang 2006), which may be detected by showing conservation in gene expression level between distantly related species (Jordan et al. 2005; Liao and Zhang 2006), reduced rate of expression change in evolution compared with that in mutation accumulation experiments (Denver et al. 2005; Rifkin et al. 2005), and significantly lower expression variations than neutral expectations (Rohlfs et al. 2014). That early suggestion was based on a lack of significant difference in the rate of expression evolution between intact genes and expressed pseudogenes (Khaitovich et al. 2004), which could have been due to low statistical power caused by the inclusion of too few expressed pseudogenes and/or the action of purifying selection on expressed pseudogenes (Khachane and Harrison 2009; Podlaha and Zhang 2010; Xu and Zhang 2016). While our study is not intended to detect purifying selection in gene expression evolution, our data are consistent with the action of purifying selection. For instance, for each gene, we measured the coefficient of variation in expression level among all nine strains (CVt) and that among the five wild strains (CVw), and found both to be negatively correlated with the importance of the gene measured by the fitness reduction caused by deleting the gene in the oak exudate medium (Qian et al. 2012) (for CVt: ρ = −0.095, P = 0.0001; for CVw: ρ = −0.081, P = 0.0009). This pattern is explainable by stronger purifying selection acting on the expression levels of more important genes.

After the removal of deleterious expression variations by purifying selection, the remaining expression variations that are observed can be neutral or adaptive. We found that they are largely neutral rather than adaptive. Given the relatively high expression variations observed (mean CVt = 0.39 and mean CVw = 0.36), this conclusion seems to be at odds with the view that gene expression levels are tightly regulated and consequently should show little neutral variation. For instance, an elegant experiment on the Escherichia coli lactose operon suggests that protein expression levels are finely tuned according to the cost and benefit of gene expression and protein production, which vary depending on the environment (Dekel and Alon 2005). Similarly, by considering the energy cost of protein synthesis, Wagner estimated that natural selection would prohibit a >2% increase in protein concentration above the optimal level for any gene that is more highly expressed than the median gene expression level in yeast (Wagner 2005). One possibility that could potentially resolve the apparent conflict between these findings and our results is the existence of posttranscriptional regulations that minimize the downstream consequences of variations in mRNA concentrations. Indeed, several studies have shown that protein concentrations are generally more conserved evolutionarily than mRNA concentrations (Schrimpf et al. 2009; Laurent et al. 2010) and that mRNA concentration differences between species are often offset by differences in translation (Khan et al. 2013; Artieri and Fraser 2014; McManus et al. 2014). The much smaller energy cost of mRNA synthesis than that of protein synthesis (Wagner 2005) also permits a larger range of neutral variation in mRNA concentration. These considerations lead us to hypothesize that the adaptive fraction of intra and interspecific variations in protein concentration is greater than the adaptive fraction of gene expression variations. With the rapid progress of quantitative proteomics, this hypothesis may be tested in the near future.

In terms of how directly a trait impacts the organism-level phenotype, the four types of traits discussed in this work can be ranked as organismal morphology, protein concentration, mRNA concentration, and genome sequence. Because natural selection acts at the organism level, it seems plausible that the more directly that a trait affects the organism-level phenotype, the higher the probability that adaptation contributes to its natural variation. This hypothesis is supported by the present study and can be further tested when comparative proteomic data aforementioned become available.

The role of stochasticity in genotypic evolution is well recognized, while that in phenotypic evolution is less appreciated and agreed upon. Our finding that natural variations in gene expression level, a molecular phenotype, is generally shaped by stochastic genetic drift rather than deterministic adaptation expands the role of stochasticity in evolution. It is likely that the role of stochasticity in evolution, compared with that of adaptation, is generally reduced as one moves from traits that impact the organismic phenotype less directly to those that impact more directly.

It has been heatedly debated whether phenotypic adaptations seen at the organism level are mainly caused by protein sequence/function changes or gene expression changes, especially those brought about by alterations of cis-regulatory sequences (Hoekstra and Coyne 2007; Stern and Orgogozo 2008). We previously provided evidence supporting the hypothesis that evolution of morphological traits is more often caused by gene expression changes while that of physiological traits is more often owing to protein function changes (Liao et al. 2010). Regardless, our finding that gene expression variations are largely neutral should reduce our expectation that an organismic adaptation is caused by expression changes (see Materials and Methods).

Given the huge effective population size of yeast, our finding that yeast expression variations are largely neutral suggests that the same would apply to species with smaller effective population sizes, which include almost all multicellular organisms. Some gene expression studies from invertebrates (Rifkin et al. 2003; Israel et al. 2016) and vertebrates (Oleksiak et al. 2002) reported at most a tiny fraction of adaptive expression variations and are thus consistent with our prediction, but a stronger test of neutral expression variation is warranted. Note that the observation (Liao and Zhang 2006; Brawand et al. 2011) that the transcriptomes of multiple organs from several mammalian species are clustered by organ rather than species (e.g., human liver transcriptome is closer to mouse liver transcriptome than to human heart transcriptome) does not distinguish between the neutral and adaptive hypotheses, because this clustering is predicted by both hypotheses due to the fact that different organs originated prior to the emergence of mammals and that they have distinct functions. It will be of great interest and importance to test the prediction that intra and interspecific expression variations in multicellulars are mostly neutral.

Materials and Methods

Yeast Genome Sequences and Phylogenetic Analysis

The genome sequences of all S. cerevisae strains used here (supplementary table S1, Supplementary Material online) were obtained from a recently completed yeast population genomic study that sequenced 85 S. cerevisiae strains from a diverse array of ecological and geographic origins (Maclean et al. 2017). S. cerevisiae genomic annotations were downloaded from SGD (Cherry et al. 2012). The two S. paradoxus and one S. mikatae genome sequences and their annotations were previously published (Scannell et al. 2011) . We first identified reciprocal best hits (RBH) between S. cerevisiae and each of the other two species in a specie-wise tBLASTx search (Camacho et al. 2009) among all annotated genes, using an E value cutoff of 10−4. To avoid the complication of gene expression changes after gene duplication, we should exclude paralogs generated after the separation of the three species and include only one-to-one orthologs among the species. To this end, we removed from the above RBH gene list any gene that is the best hit of a gene from either of the other two yeasts but not on the list, resulting in a set of 4,625 one-to-one orthologous genes. We further removed those genes that contain undetermined nucleotides in coding regions due to incomplete genomic sequencing. Our final list had 4,325 genes.

We aligned the coding sequences of each of the 4,325 genes from the nine yeast strains by MACSE (Ranwez et al. 2011) and removed alignment gaps. The aligned sequences of each gene were then used by PHYLIP (Felsenstein 1989) to estimate F84 pairwise nucleotide distances and reconstruct a neighbor-joining (NJ) tree (Saitou and Nei 1987) of the nine strains. To reconstruct the genome tree, we first concatenated the coding sequence alignments of all genes and then estimated F84 distances and built an NJ tree using PHYLIP. Statistical support for each interior branch of the genome tree was assessed by bootstrapping the 4,325 genes 1,000 times. We used a distance method of tree-making for all types of data in this study because of the lack of other phylogenetic methods that can handle all of the different types of data analyzed here.

RNA Sequencing and Transcriptome Analysis

Each of the nine yeast strains was streaked to form single colonies from frozen glycerol stocks held at −80 °C onto YPD plates (1% yeast extract, 2% peptone, 2% glucose, 2% agar). After 48 h of growth at 30 °C, a single colony was picked and inoculated into 5 ml of the synthetic oak exudate medium (1% sucrose, 0.5% fructose, 0.5% glucose, 0.1% yeast extract, and 0.15% peptone) (Murphy et al. 2006). Strains were grown for 24 h at 30 °C before dilution into fresh synthetic oak exudate medium to an OD660 of 0.1. Cultures were grown at 30 °C until OD660 = 0.5 (mid-log phase), at which point cells were harvested by centrifugation. RNA-seq libraries were prepared following a previous study (Nagalakshmi et al. 2008). Briefly, total RNA was extracted from each population using RiboPure-Yeast Kit (Ambion) and treated with DNase I to remove any contaminant DNA. Extraction of mRNA was carried out using MicroPoly(A)Purist Kit (Ambion) and 200 ng of the resulting mRNA sample was fragmented (Fragmentation Buffer, NEB) before ethanol precipitation. First strand cDNA synthesis was performed using random hexamer priming (Superscript II, Invitrogen), followed by second strand cDNA synthesis (Invitrogen) as recommended by the manufacturer. End repair, A-tailing, and ligation of the Illumina adapters necessary for sequencing were then carried out using the NEBnext mRNA sample preparation kit (NEB). Libraries were then size-selected by agarose gel electrophoresis followed by gel extraction such that libraries consisted of fragments containing inserts of ∼250 bp in length. Polymerase-chain-reaction amplification was performed for 15 cycles using NEBnext mRNA sample preparation kit, before single-end sequencing on the Illumina GAII platform was performed at the University of Michigan Sequencing Core. Sequencing statistics are listed in supplementary table S1, Supplementary Material online.

All raw read sequences generated by RNA-seq were first processed by cutadapt (Martin 2011) to trim any remaining adaptor sequences. The trimmed reads were then aligned to the genome of the corresponding strains by tophat (Pollier et al. 2013) under the default parameter set except that a maximal intron size of 10 kb was allowed because the largest annotated intron in S. cerevisiae is 9349 bp (Cherry et al. 2012). Alignment results were fed to cufflinks (Pollier et al. 2013) for quantification of known transcripts in S. cerevisiae (Cherry et al. 2012), S. paradoxus (Scannell et al. 2011), and S. mikatae (Scannell et al. 2011). Unless otherwise noted, all gene expression levels used in our analyses are in the unit of RPKM (reads per kilobases per million reads). All RNA-seq reads as well as estimated gene expression levels have been deposited in NCBI with a GEO ID of GSE81320.

The expression levels of the 4325 genes used for constructing the genome tree were analyzed. These genes each have at least one RNA-seq read in at least one of the nine strains. Let Xij be the expression level in RPKM of gene i in strain j and Xi be the mean expression level of gene i in the nine strains. The Euclidian distance in expression level of gene i between strains j and k is defined by dijk=(Xij-Xik)2=|Xij-Xik|. We then used this distance measure to build the NJ tree of the nine strains for gene i.

To analyze the transcriptome data as a whole, for each gene i, we converted the raw expression levels of the nine strains to standardized expression levels by Yij = (XijXi)/Si, where Si is the standard deviation of the expression level of gene i among the nine strains. We calculated the average transcriptomic Euclidian distance per gene between strains j and k using the standardized expression levels of n = 4325 genes by djk=i=1n(Yij-Yik)2/n. We then built the NJ tree using these distances. The confidence of the transcriptome tree was assessed by bootstrapping the 4325 genes 10,000 times. We similarly built an expression tree for each GO, biochemical pathway, or phenotype using the per gene average standardized Euclidian distances calculated from the standardized expression levels of the genes that belong to the GO, participate in the biochemical pathway, or are associated with the phenotype, respectively.

Yeast Morphological Data and Analysis

The data of 219 morphological traits from nine strains were obtained from two studies (Yvert et al. 2013; Ho et al. 2016). The original data contained 220 traits (Ho et al. 2016), but one of them (trait ID A103_C) was not used because the data from strain YJM145 were missing. The phylogenetic analysis using these traits was conducted in exactly the same way as the analysis using the gene expression data. The NJ trees for the expression and morphological data were built using the APE package (Paradis et al. 2004).

Generation of Random Trees

We generated random trees (in terms of topology) of the nine strains by repeatedly clustering two randomly chosen strains at a time until all nine strains are clustered; after two strains are clustered, they together are considered as a strain in the next round of clustering.

Topological Distance between Two Trees

Given an unrooted tree structure, each (internal or external) branch connects two sets of tips. In other words, each branch represents a bipartition of the tips. The topological distance between two unrooted trees of the same set of tips is twice the number of internal branches defining different bipartitions of the tips between the two trees (Penny and Hendy 1985).

Principal Component Analysis (PCA)

To perform PCA of the genome sequences of the nine strains, we used the concatenated multiple sequence alignment of the coding sequences of all 4325 genes. Each site of each sequence in the alignment was converted to a four dimensional vector, whose four components are assigned 1 (or 0) based on the appearance (or not) of A, C, G, and T at this site, respectively. In other words, a sequence with L nucleotides was converted to a vector of length 4L. The alignment of the nine strains was converted to nine vectors with “aligned” components. For each gene or morphological trait, the expression levels or morphological trait values from the nine strains were first scaled to the standard normal distribution before PCA. PCA was conducted using the “prcomp” function in the “stats” package in R (R. Core Team 2013).

Posterior Probabilities of Protein Function and Gene Expression Changes

The posterior probability that an organismic phenotypic adaptation is caused by an expression change, P(E|A), relative to the posterior probability that it is caused by a protein function change, P(F|A), can be calculated by [P(A|E)/P(A|F)][P(E)/P(F)] according to the Bayes’ theorem, where P(E) and P(F) are the prior probabilities of expression changes and protein function changes, respectively, and P(A|E) and P(A|F) are the probabilities of having an organismic phenotypic adaptation conditional on an expression change and a protein function change, respectively. Our result that P(A|E) is smaller than previously thought reduces the expectation that an organismic adaptation is caused by a gene expression change. The biological implication is mentioned in Discussion.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online.

Supplementary Material

Supplementary Data

Acknowledgments

This article is based on the 2016 Nei Lecture by J.Z. We thank members of the Zhang lab and two anonymous reviewers for valuable comments. This work was supported by the research grant R01GM103232 from the U.S. National Institutes of Health to J.Z.

References

  1. Artieri CG, Fraser HB.. 2014. Evolution at two levels of gene expression in yeast. Genome Res. 24:411–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, Weier M, Liechti A, Aximu-Petri A, Kircher M, et al. 2011. The evolution of gene expression levels in mammalian organs. Nature 478:343–348. [DOI] [PubMed] [Google Scholar]
  3. Bullard JH, Mostovoy Y, Dudoit S, Brem RB.. 2010. Polygenic and directional regulatory evolution across pathways in Saccharomyces. Proc Natl Acad Sci U S A. 107:5058–5063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL.. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Carroll SB. 2008. Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution. Cell 134:25–36. [DOI] [PubMed] [Google Scholar]
  6. Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, et al. 2012. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40:D700–D705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Darwin C. 1859. On the origin of species by means of natural selection. London: J. Murray. [Google Scholar]
  8. Dekel E, Alon U.. 2005. Optimality and evolutionary tuning of the expression level of a protein. Nature 436:588–592. [DOI] [PubMed] [Google Scholar]
  9. Denver DR, Morris K, Streelman JT, Kim SK, Lynch M, Thomas WK.. 2005. The transcriptional consequences of mutation and natural selection in Caenorhabditis elegans. Nat Genet. 37:544–548. [DOI] [PubMed] [Google Scholar]
  10. Dujon B. 2006. Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Trends Genet 22:375–387. [DOI] [PubMed] [Google Scholar]
  11. Endler JA. 1986. Natural selection in the wild. Princeton, N.J: Princeton University Press. [Google Scholar]
  12. Felsenstein J. 1989. PHYLIP: Phylogeny Inference Package (Version 3.2). Cladistics 5:164–166. [Google Scholar]
  13. Ferea TL, Botstein D, Brown PO, Rosenzweig RF.. 1999. Systematic changes in gene expression patterns following adaptive evolution in yeast. Proc Natl Acad Sci U S A. 96:9721–9726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Fraser HB, Moses AM, Schadt EE.. 2010. Evidence for widespread adaptive evolution of gene expression in budding yeast. Proc Natl Acad Sci U S A. 107:2977–2982. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Futuyma DJ. 2013. Evolution. Sunderland, MA: Sinauer Associates Inc. [Google Scholar]
  16. Gilad Y, Rifkin SA, Bertone P, Gerstein M, White KP.. 2005. Multi-species microarrays reveal the effect of sequence divergence on gene expression profiles. Genome Res. 15:674–680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Gresham D, Desai MM, Tucker CM, Jenq HT, Pai DA, Ward A, DeSevo CG, Botstein D, Dunham MJ.. 2008. The repertoire and dynamics of evolutionary adaptations to controlled nutrient-limited environments in yeast. PLoS Genet. 4:e1000303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Ho W-C, Ohya Y, Zhang J.. 2016. Testing the neutral hypothesis of phenotypic evolution. BioRxiv. doi: 10.1101/089987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Hoekstra HE, Coyne JA.. 2007. The locus of evolution: evo devo and the genetics of adaptation. Evolution 61:995–1016. [DOI] [PubMed] [Google Scholar]
  20. Israel JW, Martik ML, Byrne M, Raff EC, Raff RA, McClay DR, Wray GA.. 2016. Comparative developmental transcriptomics reveals rewiring of a highly conserved gene regulatory network during a major life history switch in the sea urchin genus Heliocidaris. PLoS Biol. 14:e1002391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Johnson LJ, Koufopanou V, Goddard MR, Hetherington R, Schafer SM, Burt A.. 2004. Population genetics of the wild yeast Saccharomyces paradoxus. Genetics 166:43–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Jordan IK, Marino-Ramirez L, Koonin EV.. 2005. Evolutionary significance of gene expression divergence. Gene 345:119–126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Keren L, Hausser J, Lotan-Pompan M, Vainberg Slutskin I, Alisar H, Kaminski S, Weinberger A, Alon U, Milo R, Segal E.. 2016. Massively parallel interrogation of the effects of gene expression levels on fitness. Cell 166:1282–1294: e1218. [DOI] [PubMed] [Google Scholar]
  24. Khachane AN, Harrison PM.. 2009. Assessing the genomic evidence for conserved transcribed pseudogenes under selection. BMC Genomics 10:435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Khaitovich P, Weiss G, Lachmann M, Hellmann I, Enard W, Muetzel B, Wirkner U, Ansorge W, Paabo S.. 2004. A neutral model of transcriptome evolution. PLoS Biol. 2:E132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Khan Z, Ford MJ, Cusanovich DA, Mitrano A, Pritchard JK, Gilad Y.. 2013. Primate transcript and protein expression levels evolve under compensatory selection pressures. Science 342:1100–1104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kimura M. 1968. Evolutionary rate at the molecular level. Nature 217:624–626. [DOI] [PubMed] [Google Scholar]
  28. Laurent JM, Vogel C, Kwon T, Craig SA, Boutz DR, Huse HK, Nozue K, Walia H, Whiteley M, Ronald PC, et al. 2010. Protein abundances are more conserved than mRNA abundances across diverse taxa. Proteomics 10:4209–4212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Liao BY, Weng MP, Zhang J.. 2010. Contrasting genetic paths to morphological and physiological evolution. Proc Natl Acad Sci U S A. 107:7353–7358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Liao BY, Zhang J.. 2006. Evolutionary conservation of expression profiles between human and mouse orthologous genes. Mol Biol Evol. 23:530–540. [DOI] [PubMed] [Google Scholar]
  31. Liti G, Carter DM, Moses AM, Warringer J, Parts L, James SA, Davey RP, Roberts IN, Burt A, Koufopanou V, et al. 2009. Population genomics of domestic and wild yeasts. Nature 458:337–341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Lynch M. 2007. The origins of genome architecture. Sunderland, MA: Sinauer Associates Inc. [Google Scholar]
  33. Maclean CJ, Metzger BPH, Yang J-R, Ho W-C, Moyers B, Zhang J.. 2017. Deciphering the genic basis of yeast fitness variation by simultaneous forward and reverse genetics. Mol. Biol. Evol. Doi: 10.1093/molbev/msx151. [DOI] [PubMed] [Google Scholar]
  34. Martin M. 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17:10–12. [Google Scholar]
  35. Mayr E. 2001. What evolution is?. New York: Basic Books. [Google Scholar]
  36. McManus CJ, May GE, Spealman P, Shteyman A.. 2014. Ribosome profiling reveals post-transcriptional buffering of divergent gene expression in yeast. Genome Res. 24:422–430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Murphy HA, Kuehne HA, Francis CA, Sniegowski PD.. 2006. Mate choice assays and mating propensity differences in natural yeast populations. Biol Lett. 2:553–556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M.. 2008. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320:1344–1349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Nei M. 1987. Molecular evolutionary genetics. New York: Columbia University Press. [Google Scholar]
  40. Nei M. 2013. Mutation-driven evolution. Oxford: Oxford University Press. [Google Scholar]
  41. Oleksiak MF, Churchill GA, Crawford DL.. 2002. Variation in gene expression within and among natural populations. Nat Genet. 32:261–266. [DOI] [PubMed] [Google Scholar]
  42. Orr HA. 1998. The population genetics of adaptation: the distribution of factors fixed during adaptive evolution. Evolution 52:935–949. [DOI] [PubMed] [Google Scholar]
  43. Paradis E, Claude J, Strimmer K.. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20:289–290. [DOI] [PubMed] [Google Scholar]
  44. Penny D, Hendy MD.. 1985. The use of tree comparison metrics. Syst Biol. 34:75–82. [Google Scholar]
  45. Podlaha O, Zhang J.. 2010. Pseudogenes and their evolution In: Encyclopedia of life sciences. Chichester, UK: John Wiley & Sons; p. 1–8. [Google Scholar]
  46. Pollier J, Rombauts S, Goossens A.. 2013. Analysis of RNA-Seq data with TopHat and Cufflinks for genome-wide expression analysis of jasmonate-treated plants and plant cultures. Methods Mol Biol. 1011:305–315. [DOI] [PubMed] [Google Scholar]
  47. Qian W, Ma D, Xiao C, Wang Z, Zhang J.. 2012. The genomic landscape and evolutionary resolution of antagonistic pleiotropy in yeast. Cell Rep. 2:1399–1410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Qian W, Zhang J.. 2014. Genomic evidence for adaptation by gene duplication. Genome Res. 24:1356–1362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. R. Core Team. 2013. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
  50. Ranwez V, Harispe S, Delsuc F, Douzery EJ.. 2011. MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. PLoS One 6:e22594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Rifkin SA, Houle D, Kim J, White KP.. 2005. A mutation accumulation assay reveals a broad capacity for rapid evolution of gene expression. Nature 438:220–223. [DOI] [PubMed] [Google Scholar]
  52. Rifkin SA, Kim J, White KP.. 2003. Evolution of gene expression in the Drosophila melanogaster subgroup. Nat Genet. 33:138–144. [DOI] [PubMed] [Google Scholar]
  53. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK.. 2015. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43:e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Rohlfs RV, Harrigan P, Nielsen R.. 2014. Modeling gene expression evolution with an extended Ornstein–Uhlenbeck process accounting for within-species variation. Mol Biol Evol. 31:201–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Saitou N, Nei M.. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 4:406–425. [DOI] [PubMed] [Google Scholar]
  56. Scannell DR, Zill OA, Rokas A, Payen C, Dunham MJ, Eisen MB, Rine J, Johnston M, Hittinger CT.. 2011. The awesome power of yeast evolutionary genetics: new genome sequences and strain resources for the Saccharomyces sensu stricto Genus. G3 (Bethesda) 1:11–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Schrimpf SP, Weiss M, Reiter L, Ahrens CH, Jovanovic M, Malmstrom J, Brunner E, Mohanty S, Lercher MJ, Hunziker PE, et al. 2009. Comparative functional analysis of the Caenorhabditis elegans and Drosophila melanogaster proteomes. PLoS Biol. 7:e48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Sniegowski PD, Dombrowski PG, Fingerman E.. 2002. Saccharomyces cerevisiae and Saccharomyces paradoxus coexist in a natural woodland site in North America and display different levels of reproductive isolation from European conspecifics. FEMS Yeast Res. 1:299–306. [DOI] [PubMed] [Google Scholar]
  59. Stern DL, Orgogozo V.. 2008. The loci of evolution: how predictable is genetic evolution? Evolution 62:2155–2177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Wagner A. 2005. Energy constraints on the evolution of gene expression. Mol Biol Evol. 22:1365–1374. [DOI] [PubMed] [Google Scholar]
  61. Wagner GP, Kin K, Lynch VJ.. 2012. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131:281–285. [DOI] [PubMed] [Google Scholar]
  62. Wang QM, Liu WQ, Liti G, Wang SA, Bai FY.. 2012. Surprisingly diverged populations of Saccharomyces cerevisiae in natural environments remote from human activity. Mol Ecol. 21:5404–5417. [DOI] [PubMed] [Google Scholar]
  63. Warringer J, Zorgo E, Cubillos FA, Zia A, Gjuvsland A, Simpson JT, Forsmark A, Durbin R, Omholt SW, Louis EJ, et al. 2011. Trait variation in yeast is defined by population history. PLoS Genet. 7:e1002111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Xu J, Zhang J.. 2016. Are human translated pseudogenes functional?. Mol Biol Evol. 33:755–760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Yvert G, Ohnuki S, Nogami S, Imanaga Y, Fehrmann S, Schacherer J, Ohya Y.. 2013. Single-cell phenomics reveals intra-species variation of phenotypic noise in yeast. BMC Syst Biol. 7:54. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Molecular Biology and Evolution are provided here courtesy of Oxford University Press

RESOURCES