Abstract
With the rise of high-throughput sequencing technologies, a holistic view of genetic variation within populations—through population genomics studies—appears feasible, although it remains an ongoing effort. Genetic variation arises from a diverse range of evolutionary forces, with mutation and recombination being key drivers in shaping genomes. Studying genetic variation within a population represents a crucial first step in understanding the relationship between genotype and phenotype and the evolutionary history of species. In this context, the budding yeast Saccharomyces cerevisiae has been at the forefront of population genomic studies. In addition, it has a complex history that involves adaptation to a wide range of wild and human-related ecological niches. Although to date more than 3,000 diverse isolates have been sequenced, there is currently a lack of a resource bringing together sequencing data and associated metadata for all sequenced isolates. To perform a comprehensive analysis of the population structure of S. cerevisiae, we collected genome sequencing data from 3,034 natural isolates and processed the data uniformly. We determined ploidy levels, identified single nucleotide polymorphisms (SNPs), small insertion–deletions (InDels), copy number variations (CNVs), and aneuploidies across the population, creating a publicly accessible resource for the yeast research community. Interestingly, we showed that this population captures ∼93% of the species diversity. Using neighbor-joining and Bayesian methods, we redefined the populations, revealing clustering patterns primarily based on ecological origin. This work represents a valuable resource for the community and efforts have been made to make it evolvable and integrable to future yeast population studies.
Keywords: population genomics, population structure, polymorphisms, genetic diversity
Introduction
Species naturally exhibit a certain degree of genetic diversity among individuals. This diversity contributes to a significant fraction of observed phenotypic variation and is crucial to both the survival and evolutionary potential of a species. In fact, it has long been established that genetic diversity provides the raw material on which natural selection can act (Fisher 1930). The advent of whole-genome sequencing at the population level has facilitated the analysis of intraspecific genomic diversity, giving rise to the field of population genomics. Short-read sequencing of large populations allows for the construction of dense maps of the genetic polymorphisms, including single nucleotide polymorphisms (SNPs), small insertion-deletions (InDels), and copy number variations (CNVs). Subsequently, analyses of these polymorphisms can provide insights into external forces that shape genomes, such as bottlenecks or selection occurring during domestication processes as well as environmental changes.
The budding yeast Saccharomyces cerevisiae, given its small genome, high genetic diversity, and complex domestication history, has emerged as a widely used model organism for population genomics. This species has colonized a wide range of wild and human-related habitats and, as a result, has been subject to multiple evolutionary constraints. Isolates have been found in primary and secondary forests across the world, with a notable prevalence in Asia, in the fermentation processes of several beverages, including wine, beer, sake, and tequila, and have also been identified as opportunistic human pathogens (McCusker et al. 1994; Wang et al. 2012; Parapouli et al. 2020; Gallegos-Casillas et al. 2024). A multitude of population genomics analyses have permitted the description of wild and domesticated populations at the genetic level. Initial studies on a small number of individuals (between 36 and 63) revealed the existence of well-defined populations corresponding to different ecological origins, with the notable identification of wine and sake clades (Liti et al. 2009; Schacherer et al. 2009). The grouping of geographically diverse wine isolates in a single clade, coupled with the low diversity of this clade, points to a single domestication event associated with a bottleneck as the probable origin of all wine isolates. In addition to clades originating from a single rule ancestry, signals of high admixture were detected in some isolates. Subsequent analysis of 100 genomes with a particular focus on clinical isolates revealed that they are overrepresented in admixed isolates, suggesting that outcrossing facilitated adaptation to this novel ecological niche (Strope et al. 2015). Using the same dataset, regions of high diversity were identified as introgressions from the sister species of S. cerevisiae, namely Saccharomyces paradoxus. A high number of introgressed genes were reported in isolates that were subsequently assigned to the Alpechin lineage, derived from olive oil production wastewater (Peter et al. 2018; Pontes et al. 2019). To better characterize populations found in human-related niches, a later survey focused on 157 industrial isolates, revealing hallmarks of domestication (Gallone et al. 2016). Notably, beer isolates exhibit high ploidy levels, high heterozygosity, and impaired sexual reproduction. Further work on Asian isolates demonstrated a single domestication event for all Asian fermentation isolates (Duan et al. 2018). Moreover, wild isolates predominantly found in Asia display a greater level of diversity and a lower heterozygosity compared to domesticated isolates. Efforts were made to assemble and sequence a large collection of 1,011 diverse natural isolates, allowing the precise definition of clades within the species (Peter et al. 2018). This revealed a single out-of-China origin for S. cerevisiae, followed by a complex history of domestication and adaptation to a wide range of niches, shaped by introgressions and loss of heterozygosity events. Additional niche-specific populations were later sequenced, with a notable addition of wild isolates sampled in Taiwanese forests (Lee et al. 2022). Although more than 3,000 isolates have now been sequenced, the datasets are scattered and a comprehensive reference of all studies and data is not available. While populations larger than 1,011 were gathered to investigate the gene-based pangenome of the species (Li et al. 2019; Wang et al. 2024a, 2024b) or to focus on introgressions (Tellini et al. 2024), no further refinement of the population structure and the evolutionary history has been achieved.
To conduct a comprehensive analysis of the population structure of S. cerevisiae, we collected genome sequences for 3,034 natural isolates and processed the data in a standardized manner. We inferred ploidy and detected SNPs, InDels, CNVs, and aneuploidies in the population to build a publicly available resource for the yeast community. Neighbor-joining and Bayesian approaches were applied to redefine the clades, which showed clustering primarily based on ecological origin. By providing the raw gvcf files and standard protocols, we aim to make this resource easily expandable as new isolates are sequenced.
Materials and methods
Data collection
We gathered Illumina sequencing data of 3,039 Saccharomyces cerevisiae natural isolates, coming from 29 publications, with a minimum sequencing depth of 20X. Reads were mapped on the R64 reference genome using bwa-mem2 v2.2.1 (Vasimuddin et al. 2019) with default parameters and samtools sort v1.15.1 (Danecek et al. 2021).
SNPs and InDels calling
A multisample VCF containing both variant and non-variant positions was generated following the GATK Germline short variant discovery workflow (https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-), with version 4.2.3.0 (Poplin et al. 2018 Jul 24). The raw VCF was filtered with bcftools v1.15.1 (Danecek et al. 2021) for quality (DP ≥ 10, GQ ≥ 20), missing genotypes (missing genotype per sample below 20% and missing genotype per loci below 1%), and excess of heterozygosity (ExcHet > 0.99). Out of the 3,039 samples, 3,034 passed the missing genotype filters and were considered for further analyses. Variant calls were further separated into 2 files, containing SNPs and InDels, along with non-variant positions. Complex loci, spanning both SNPs and InDels were filtered out. All further VCF filtering was performed with bcftools, unless otherwise mentioned. Callable sites mentioned in the results section refer to the non-variant positions in addition to SNP loci (loci where InDels or both InDels and SNPs are present were not considered for the SNP analysis).
Zygosity and ploidy estimation
For each sample, the number of heterozygous SNPs was estimated by first extracting each sample from the multisample SNPs VCF (bcftools view -s), then filtering only heterozygous positions (bcftools view -i “GT = \“het\”’). Heterozygosity was computed as the ratio of heterozygous SNPs over the total number of callable sites in the SNP's VCF (9,368,983 positions). Samples containing more than 500 heterozygous SNPs (1,756 samples) were considered heterozygous, and the rest were considered homozygous (1,278 samples).
For samples coming from Peter et al. 2018, Lee et al. 2022, and Marr et al. 2023 (1,207 samples in total), ploidy information was retrieved from the original papers as it was estimated experimentally using flow cytometry. For the remaining samples, ploidy was estimated from sequencing data using nQuire (Weiß et al. 2018). Samples with known ploidy were used to benchmark each way of estimating the ploidy with nQuire: (i) maximizing the log-likelihood with nQuire lrdmodel, (ii) using the lower sum of squared residuals, or (iii) higher regression R2 of the nQuire histotest command. The latter had the best success rate (with a success rate of 73.8, 90.5, and 93.6%, respectively) and was used to estimate the ploidy of the remaining samples. As ploidy cannot be reliably estimated for homozygous samples, those sample's ploidy is set to unknown (648 samples).
CNV calling and aneuploidy detection
CNVs along the genome were estimated for each sample with a depth-based method, CNVnator v0.4.1 (Abyzov et al. 2011), using a 1-kb sliding window size. The CNV of each coding sequence (CDS) was computed as the median normalized sequencing depth of the CDS. In practice, for each position of the reference genome, we assigned a normalized sequencing depth of either 1 when the region was not detected as a CNV or the value of the Normalized_RD column of the CNVnator output. The median normalized sequencing depth was then calculated for each CDS. A chromosome was considered as aneuploid when more than half of its length was in CNV.
Clades definition
Clades were defined with a hybrid method using fastStructure (Raj et al. 2014) and manual refining based on the neighbor-joining tree. fastStructure was run on SNPs filtered for biallelic variants, minor allele frequency superior to 5% and linkage disequilibrium-pruned with plink 1.9 [Chang et al. 2015 (plink –indep-pairwise 50 1 0.5)], corresponding to a total of 25,194 SNPs. fastStructure was run with K from 2 to 50. The chooseK command was used to identify the best number of components in the population (Model components used to explain structure in data: K = 38). The neighbor-joining tree was constructed with 1,918,693 SNPs using the R packages SNPRelate (Zheng et al. 2017) and ape (Paradis and Schliep 2019); 38 groups were constructed based on the K = 38 ancestry components, taking samples with more than 60% of ancestry. Groups were further manually refined to remove outliers on the neighbor-joining tree. Groups with unclear segregation on the tree (within wine isolates) were merged and manually clustered based on the phylogeny. Despite being in the same ancestry component, Mexican agave and French Guiana samples were split into 2 groups because of their ecological origin difference and their clear separation on the tree. 39 clades were finally obtained. Four superclades were then constructed by grouping clades according to concordant ecological origin and phylogeny. A total of 9, 4, 4, and 5 clades constitute the Wine, Beer, Asian Fermentation, and Wild superclades.
Results
Species-wide diversity
To create a comprehensive set of sequenced isolates, we filtered public sequence databases for Illumina whole-genome sequences of S. cerevisiae strains, harboring a minimum sequencing depth of 20×. All genetically modified isolates were discarded. In the end, we collected whole-genome sequencing data for a total of 3,039 isolates of S. cerevisiae generated through 29 population genomics surveys (Dunn et al. 2012; Zheng et al. 2012; Almeida et al. 2015; Hose et al. 2015; Marsit et al. 2015; Song et al. 2015; Strope et al. 2015; Barbosa et al. 2016; Borneman et al. 2016; Franco-Duarte et al. 2016; Gallone et al. 2016; Gayevskiy et al. 2016; Zhu et al. 2016; Coi et al. 2017; Kita et al. 2017; Maclean et al. 2017; Duan et al. 2018; Peter et al. 2018; Preiss et al. 2018; Fay et al. 2019; Basile et al. 2021; Han et al. 2021; Higgins et al. 2021; Ruiz et al. 2021; Lee et al. 2022; Marr et al. 2023; Morard et al. 2023; Avelar-Rivas et al. 2024; Ward et al. 2024). The isolates were sampled in 94 countries across six continents (Fig. 1a) and have a large variety of ecological origins. Isolates are involved in many human-related processes, including winemaking, beer fermentation, spirits production, and bakery, in addition to clinical and wild isolates (Supplementary Table 1). The sequencing reads of all 3,039 isolates were mapped on the reference genome to infer the genetic variants in the population. A total of 1,918,693 SNPs and 58,947 InDels were detected across 3,034 isolates, and five were discarded because of a high fraction of missing genotypes. SNPs account for 20.5% of callable sites on the reference genome. The population nucleotide diversity (median π = 3.6 × 10−3) is slightly increased from the previous estimations based on 1,011 isolates (Peter et al. 2018). We find a mean number of 51,924 ± 22,907 SNPs (0.6% of callable sites) between 2 randomly selected isolates (Fig. 1b), and a maximum number of 168,614 SNPs (1.8%) between a wild Taiwanese isolate (CEI) and a Mexican isolate from agave distillation (XTRA_FIU). The level of heterozygosity within the population is highly variable, with 1,278 (42%) isolates being homozygous and others having up to 65,785 heterozygous SNPs. Ploidy information based on flow-cytometry data was retrieved from the original publication for 1,207 isolates. For the rest of the population, the ploidy of heterozygous isolates was estimated from allelic frequencies. The ploidy of homozygous isolates without flow-cytometry data (648 isolates) could not be retrieved solely from sequencing data, leading to a total of 2,386 isolates with ploidy information. Although most of the isolates are diploid (74% of isolates with known or estimated ploidy), a large variation in ploidy level is observed within the population. Polyploids, having more than 2 copies of the genome, account for 20% of the isolates, and 6% are haploid. Expectedly, the rate of heterozygosity increases with the number of genome copies (Fig. 1c, ρ = 0.5767823, P-value < 2.2 × 10−16). Seven haploid isolates present a considerable number of heterozygous sites (from 666 to more than 4,500 heterozygous SNPs), because of the presence of aneuploidies or large segmental duplications in their genomes. Aneuploidies were detected based on chromosomal sequencing depth for each isolate and are frequent in the population, with 927 isolates (31%) having at least one aneuploid chromosome. As for the heterozygosity rate, we observe an increase in the frequency of aneuploidies in isolates with a high ploidy (Fig. 1d). While 28% of the diploid isolates are aneuploid, this proportion rises to 52% for polyploid isolates. The frequency of aneuploidies is also correlated to the length of the chromosome, as 74% of the observed events affect the 4 shortest chromosomes [chromosome 1, 3, 6, and 9 (Supplementary Fig. 1)].
Fig. 1.
Origin and diversity of 3,034 natural isolates. a) Map of the geographical origin of the isolates. The geographic coordinates were inferred using the R package tidygeocoder (Cambon et al. 2021). b) Distribution of the number of SNPs between 2 randomly selected isolates. The vertical line represents the mean number of 51,925 SNPs. The percentage of genetic diversity is the ratio of the number of sites to the number of callable sites. c) Percentage of heterozygosity for each ploidy level. Heterozygosity is the ratio of the number of heterozygous SNPs to the number of callable sites. d) Number of euploid and aneuploid isolates per ploidy level. e) SNPs rarefaction curves, using all SNPs and non-singleton SNPs. The curve for all SNPs was fitted with a power law curve (), and the curve with non-singleton SNPs with a Michaelis–Menten equation (). The horizontal line represents the estimated total number of non-singleton SNPs.
Estimation of the species diversity captured by the 3,034 population
To estimate the fraction of species diversity captured by the population, we built rarefaction curves of the genetic variations in the population using SNPs and InDels as markers, taking all variants and non-singleton variants (i.e. variants present in at least 2 isolates). The increase in the number of variants as a function of the number of isolates follows a power law, making extrapolation impossible as the number of variants tends toward infinity (Fig. 1e, Supplementary Fig. 2–3). However, since the number of non-singleton variants follows a Michaelis–Menten equation, it is possible to estimate the total number of non-singleton variants in the species as the limit of the equation with an infinite number of isolates. Accordingly, our population covers 93.2% of non-singleton SNPs and 92.9% of non-singleton InDels in the species. In comparison, the 1,011 S. cerevisiae collection (Peter et al. 2018) covers 60.8% of the species diversity (SNPs-based, 60.9% using InDels). It should be noted that this estimation corresponds to the higher bound of species diversity fraction covered by our dataset. The addition of a new divergent population would considerably increase the number of non-singleton variants, thereby reducing the diversity captured by the 3,034 isolates.
A broader view of S. cerevisiae population structure
We sought to redefine the structure of the S. cerevisiae population by constructing a neighbor-joining tree of the 3,034 isolates based on 1,918,693 SNPs. In addition, population structure was inferred based on 25,194 common and pruned SNPs, using an ideal number of K = 38 ancestry components to explain the population structure (Supplementary Fig. 4). Both approaches were combined and manually refined to define 39 clades within the population (Fig. 2, Supplementary Fig. 5, Supplementary Tables 1-2). The clades were named based on the geographical or ecological origin of the majority of the constituent isolates. Additionally, 4 superclades were defined by grouping clades with similar ecological origins and consistent positions on the phylogenetic tree. Thus, the Wine (1,247 isolates), Beer (310), Asian Fermentation (305), and Wild (343) superclades were defined (Fig. 2). The number of isolates per clade ranges from nine (32. US clinical 3) to 273 (8. AU Wine 4), with a median of 46 isolates per clade. A total of 321 isolates (10.6%) were not assigned to any clade or superclade due to their admixture of multiple ancestry components. These isolates represent a variety of ecological origins, including 64 from clinical sources, 73 from natural environments, and 115 from diverse fermentation processes. We observe a sampling bias toward domesticated isolates, with 2,177 isolates, compared to wild (390 isolates) and other (146 isolates) ecological origins. In comparison to the 1,011 S. cerevisiae dataset (Peter et al. 2018), the 3,034 population contains novel genetic clades as well as a globally increased sample size. The collection mainly gained Australian wine isolates (Ward et al. 2024), which form specific clades diverging from the previous wine clades, and Asian fermentation isolates (Duan et al. 2018; Han et al. 2021) that add novel diversity around the sake clade. These additional clades explain the increase in nucleotide diversity of the global population. There is also a substantial increase in the sample size of Chinese and Taiwanese wild isolates, although the diversity brought by these clades was already represented in the 1,011 population (Lee et al. 2022). A notable increase in the sample size of Wine superclade now permits the observation of the Alpechin clade branching within the wine isolates (Fig. 2, Supplementary Fig. 5). This suggests that the Alpechin lineage originated from domesticated wine isolates that colonized this niche and hybridized with S. paradoxus isolates.
Fig. 2.
Neighbor-joining tree built using 1,918,693 SNPs. Isolates are colored according to the 39 clades and 4 superclades that were identified. Subtrees of the 4 superclades are magnified (S1. Wine, S2. Beer, S3. Asian Fermentation, and S4. Wild).
Differences in diversity between clades
A variable level of diversity is observed across clades. Notably, the diversity observed at the ploidy level is highly clade-dependent. Polyploid isolates are almost exclusively found in the Beer and Asian Fermentation superclades, suggesting that polyploidization may be a hallmark of these domestication processes (Fig. 3a). An enrichment for polyploid isolates is significant not only in Beer isolates (odds ratio = 0.13, P-value < 2.2 × 10−16), as previously described, but also in Baiju isolates, where 35% are polyploids (odds ratio = 0.34, P-value = 8.7 × 10−7). However, despite domestication, the Wine superclade exhibits a similar proportion of polyploids compared to Wild isolates (odds ratio = 1.6, P-value = 0.15). In addition to the reported variation in ploidy, a variation in heterozygosity level is also observed. The heterozygosity of Beer isolates is significantly higher than that of the rest of the population (W = 794969, P-value < 2.2 × 10−16), with this difference driven by a positive correlation between high ploidy and high heterozygosity rate (Fig. 3b, Fig. 1c). To a lesser extent, a similar increase in heterozygosity is observed in Asian Fermentation isolates, driven by Baiju polyploid isolates (W = 245440, P-value < 2.2 × 10−16). Nevertheless, the prevalence of polyploid isolates does not fully account for the elevated heterozygosity observed in domesticated clades. Despite a comparable proportion of polyploid isolates, the heterozygosity rate in the Wine superclade is slightly higher than that of Wild isolates (W = 224532, P-value = 0.022). This indicates that heterozygosity may be a second hallmark of domestication. Finally, a variation in the pairwise nucleotide diversity is also evident across clades. Despite the increased sample size for Wine isolates, the Wine superclade still exhibits the lowest diversity, providing evidence for the strong bottleneck that occurred during the domestication of Wine isolates (Fig. 3c). While wild isolates have a greater diversity globally than domesticated clades, isolates of the Chinese Wild clade exhibit particularly high diversity compared to the other clades. This clade, previously defined as the Taiwanese clade, has been reported as having descended from the ancestral wild S. cerevisiae population (Peter et al. 2018).
Fig. 3.
Diversity per clade: a) ploidy level per clade, b) fraction of heterozygous position per clade, and c) pairwise nucleotide diversity per clade, computed over 10-kb windows sliding by 1-kb steps.
Creation of a shareable resource
In addition to providing an overview of the natural population of S. cerevisiae sequenced so far, this work aims to build an accessible resource that catalogs sequenced isolates as well as natural genetic variants in the species. The metadata for the whole population was gathered, including geographic and ecological origin (Supplementary Table 1). Additionally, zygosity, ploidy (when available), and aneuploidies were systematically referenced for each isolate (Supplementary Table 1). To catalog small genetic variants, we used a standardized and commonly used variant calling pipeline (see Methods) and provide catalogs of high-quality SNPs and InDels (Loegler et al. 2024a). Furthermore, CNV data were estimated for each gene of the reference genome in each isolate. As there is no standard way to detect and encode CNVs, we decided not to provide a number of gene copies, but rather give a normalized sequencing depth for each gene (see Methods). By providing minimally processed data, we hope to prevent any potential bias that may arise from choosing an arbitrary depth threshold to state the presence of a gene. Future users will be accountable for transforming this data according to their own preferences. This catalog of small variants and CNVs will provide information about the occurrence and frequency of natural variants in the yeast community.
Moreover, the inclusion of all or part of 3,034 isolates in future genomic studies may be highly valuable to locate newly sequenced isolates within the species' phylogeny. However, processed variant matrices cannot be properly combined with novel sequencing data, as the sequencing data for each isolate must be handled in the exact same way. Collecting and processing raw sequencing reads from a previously sequenced large population can be extremely time-consuming, as sequencing files are large (more than 5 terabytes for the 3,034 isolates) and scattered across databases. To facilitate the proper integration of the 3,034 population in future genomic analyses, we provide the gvcf file of each isolate, which can be combined with novel sequencing data to build a multisample vcf (Loegler et al. 2024b). A detailed protocol will make this operation straightforward to implement (https://haploteam.github.io/ProtocolSacePopulation).
Discussion
Using the sequenced genomes publicly available for more than 3,000 isolates of S. cerevisiae, we established a large catalog of the small variants and CNVs of the species and redefined the clades present in this population. Most findings from analyses of 1,011 isolates have been confirmed, including the high ploidy and zygosity of Beer isolates, the high diversity of wild isolates, and the bottleneck associated with domestication of Wine isolates (Peter et al. 2018). Nevertheless, our dataset exceeds by more than 3 times the number of isolates used in this previous structure analysis. We notably highlight 3 domesticated superclades, which include Wine, Beer, and Asian fermentation isolates, and 1 wild superclade. While the wild superclade is mainly originating from Asia, as previously known, the presence of isolates from North and South America within the same genetic cluster suggests a broader geographic distribution of this clade. Further effort in sampling wild isolates across the world is required to confirm this observation. The differences in ploidy, heterozygosity, and nucleotide diversity of the domesticated superclades suggest unique domestication processes, and considering all domesticated isolates as a single group for comparison with wild isolates is not appropriate. While many studies argue for the possibility of multiple domestication events for S. cerevisiae (Fay and Benavides 2005; Schacherer et al. 2009; Gallone et al. 2016; Peter et al. 2018), further research is necessary to gain a precise understanding of the demographic history of the species. To address this question, our dataset provides the right material to perform complex demographic modeling (Schraiber and Akey 2015).
Considering the multiplicity of S. cerevisiae sequencing datasets and the difficulties in collecting them, we aim to establish this work as a valuable resource for the yeast community by giving access to the catalog of genetic variants and metadata for this population. As the number of sequenced isolates keeps increasing, we also provide the raw gvcf files that can be used to build the multisample variant matrices, along with a standardized protocol to process the reads (https://haploteam.github.io/ProtocolSacePopulation). This will facilitate the inclusion of all or part of the 3,034 population in future population genomic studies.
A major limitation of our study stands in the type of genetic variation detected. The use of short-read sequencing data does not allow for the reliable detection of structural variants (SVs) in the population, which is essential for the generation of a comprehensive catalog of genetic variants at the population level. Moreover, the variation in the gene content can only be partially assessed, as contiguous genome assemblies are needed for the proper construction of a gene-based pangenome. In light of this, long-read sequencing has been conducted in S. cerevisiae on smaller populations, offering a preliminary description of the SV landscape (Istace et al. 2017; Lee et al. 2022; O’Donnell et al. 2023). However, given the limited sample size, the SVs detected did not reach saturation, leaving a fraction of the genetic diversity uncaptured. By extrapolating the results obtained from telomere-to-telomere genome assemblies of 142 isolates (O’Donnell et al. 2023), it can be assumed that a population of at least 500 individuals would be needed to catalog SVs in an exhaustive manner. The combination of high-quality genome assemblies for such a population and the dataset we have assembled will provide a comprehensive characterization of the genetic variation present within the species.
Supplementary Material
Acknowledgments
We thank Omar Abou Saada as well as all the members of the HaploTeam (https://www.haploteam.org/) for the fruitful discussions and their valuable advice.
Contributor Information
Victor Loegler, Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, 67000, France.
Anne Friedrich, Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, 67000, France.
Joseph Schacherer, Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, 67000, France; Institut Universitaire de France (IUF), Paris, 75005, France.
Data availability
Codes used for all analyses and figures, and catalogs of SNPs, InDels, and CNVs are available in a first Zenodo repository: https://doi.org/10.5281/zenodo.12580561 (Loegler et al. 2024a). gvcf Files are available in a second repository: https://doi.org/10.5281/zenodo.12571280 (Loegler et al. 2024b).
Supplemental material available at G3 online.
Funding
This work was supported by a National Institutes of Health grant R01 (GM147040-01) and a European Research Council Consolidator grant no. 772505 to JS.
Literature cited
- Abyzov A, Urban AE, Snyder M, Gerstein M. 2011. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21(6):974–984. doi: 10.1101/gr.114876.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Almeida P, Barbosa R, Zalar P, Imanishi Y, Shimizu K, Turchetti B, Legras J-L, Serra M, Dequin S, Couloux A, et al. 2015. A population genomics insight into the Mediterranean origins of wine yeast domestication. Mol Ecol. 24(21):5412–5427. doi: 10.1111/mec.13341. [DOI] [PubMed] [Google Scholar]
- Avelar-Rivas JA, Sedeño I, García-Ortega LF, Aragon JAU, Mancera E, DeLuna A, Morales L. Recurrent introgression and geographical stratification shape Saccharomyces cerevisiae in the Neotropics. bioRxiv 615306. 28 September 2024, preprint: not peer reviewed, https://www.biorxiv.org/content/10.1101/2024.09.27.615306v1. [accessed 2024 Oct 1]. [Google Scholar]
- Barbosa R, Almeida P, Safar SVB, Santos RO, Morais PB, Nielly-Thibault L, Leducq J-B, Landry CR, Gonçalves P, Rosa CA, et al. 2016. Evidence of natural hybridization in Brazilian wild lineages of Saccharomyces cerevisiae. Genome Biol Evol. 8(2):317–329. doi: 10.1093/gbe/evv263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Basile A, De Pascale F, Bianca F, Rossi A, Frizzarin M, De Bernardini N, Bosaro M, Baldisseri A, Antoniali P, Lopreiato R, et al. 2021. Large-scale sequencing and comparative analysis of oenological Saccharomyces cerevisiae strains supported by nanopore refinement of key genomes. Food Microbiol. 97:103753. doi: 10.1016/j.fm.2021.103753. [DOI] [PubMed] [Google Scholar]
- Borneman AR, Forgan AH, Kolouchova R, Fraser JA, Schmidt SA. 2016. Whole genome comparison reveals high levels of inbreeding and strain redundancy across the Spectrum of commercial wine strains of Saccharomyces cerevisiae. 6(4):957–971. doi: 10.1534/g3.115.025692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cambon J, Hernangómez D, Belanger C, Possenriede D. 2021. Tidygeocoder: an R package for geocoding. J Open Source Softw. 6(65):3544. doi: 10.21105/joss.03544. [DOI] [Google Scholar]
- Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. 2015. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 4(1):s13742-015-0047–8. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coi AL, Bigey F, Mallet S, Marsit S, Zara G, Gladieux P, Galeote V, Budroni M, Dequin S, Legras JL. 2017. Genomic signatures of adaptation to wine biological ageing conditions in biofilm-forming flor yeasts. Mol Ecol. 26(7):2150–2166. doi: 10.1111/mec.14053. [DOI] [PubMed] [Google Scholar]
- Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, et al. 2021. Twelve years of SAMtools and BCFtools. Gigascience. 10(2):giab008. doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duan S-F, Han P-J, Wang Q-M, Liu W-Q, Shi J-Y, Li K, Zhang X-L, Bai F-Y. 2018. The origin and adaptive evolution of domesticated populations of yeast from far east Asia. Nat Commun. 9(1):2690. doi: 10.1038/s41467-018-05106-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunn B, Richter C, Kvitek DJ, Pugh T, Sherlock G. 2012. Analysis of the Saccharomyces cerevisiae pan-genome reveals a pool of copy number variants distributed in diverse yeast strains from differing industrial environments. Genome Res. 22(5):908–924. doi: 10.1101/gr.130310.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fay JC, Benavides JA. 2005. Evidence for domesticated and wild populations of Saccharomyces cerevisiae. PLoS Genet. 1(1):66–71. doi: 10.1371/journal.pgen.0010005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fay JC, Liu P, Ong GT, Dunham MJ, Cromie GA, Jeffery EW, Ludlow CL, Dudley AM. 2019. A polyploid admixed origin of beer yeasts derived from European and Asian wine populations. PLoS Biol. 17(3):e3000147. doi: 10.1371/journal.pbio.3000147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fisher RA. 1930. In: Bennett JH, editors. The Genetical Theory of Natural Selection: a Complete Variorum Edition. Clarendon Press, Oxford, Variorum edition. Oxford: Oxford University Press. [Google Scholar]
- Franco-Duarte R, Umek L, Mendes I, Castro CC, Fonseca N, Martins R, Silva-Ferreira AC, Sampaio P, Pais C, Schuller D. 2016. New integrative computational approaches unveil the Saccharomyces cerevisiae pheno-metabolomic fermentative profile and allow strain selection for winemaking. Food Chem. 211:509–520. doi: 10.1016/j.foodchem.2016.05.080. [DOI] [PubMed] [Google Scholar]
- Gallegos-Casillas P, García-Ortega LF, Espinosa-Cantú A, Avelar-Rivas JA, Torres-Lagunes CG, Cano-Ricardez A, García-Acero ÁM, Ruiz-Castro S, Flores-Barraza M, Castillo A, et al. 2024. Yeast diversity in open agave fermentations across Mexico. Yeast. 41(1–2):35–51. doi: 10.1002/yea.3913. [DOI] [PubMed] [Google Scholar]
- Gallone B, Steensels J, Prahl T, Soriaga L, Saels V, Herrera-Malaver B, Merlevede A, Roncoroni M, Voordeckers K, Miraglia L, et al. 2016. Domestication and divergence of Saccharomyces cerevisiae beer yeasts. Cell. 166(6):1397–1410.e16. doi: 10.1016/j.cell.2016.08.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gayevskiy V, Lee S, Goddard MR. 2016. European derived Saccharomyces cerevisiae colonisation of New Zealand vineyards aided by humans. FEMS Yeast Res. 16(7):fow091. doi: 10.1093/femsyr/fow091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han D-Y, Han P-J, Rumbold K, Koricha AD, Duan S-F, Song L, Shi J-Y, Li K, Wang Q-M, Bai F-Y. 2021. Adaptive gene content and allele distribution variations in the wild and domesticated populations of Saccharomyces cerevisiae. Front Microbiol. 12:631250. doi: 10.3389/fmicb.2021.631250. https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2021.631250/full. [accessed 2024 Jun 24]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Higgins P, Grace CA, Lee SA, Goddard MR. 2021. Whole-genome sequencing from the New Zealand Saccharomyces cerevisiae population reveals the genomic impacts of novel microbial range expansion. 11(1):jkaa027. doi: 10.1093/g3journal/jkaa027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hose J, Yong CM, Sardi M, Wang Z, Newton MA, Gasch AP. 2015. Dosage compensation can buffer copy-number variation in wild yeast. eLife. 4:e05462. doi: 10.7554/eLife.05462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Istace B, Friedrich A, d’Agata L, Faye S, Payen E, Beluche O, Caradec C, Davidas S, Cruaud C, Liti G, et al. 2017. de novo assembly and population genomic survey of natural yeast isolates with the Oxford nanopore MinION sequencer. Gigascience. 6(2):1–13. doi: 10.1093/gigascience/giw018. https://academic.oup.com/gigascience/article/6/2/giw018/2865217. [Accessed 2020 Nov 10]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kita R, Venkataram S, Zhou Y, Fraser HB. 2017. High-resolution mapping of cis-regulatory variation in budding yeast. Proc Natl Acad Sci. 114(50):E10736–E10744. doi: 10.1073/pnas.1717421114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee TJ, Liu Y-C, Liu W-A, Lin Y-F, Lee H-H, Ke H-M, Huang J-P, Lu M-YJ, Hsieh C-L, Chung K-F, et al. 2022. Extensive sampling of Saccharomyces cerevisiae in Taiwan reveals ecology and evolution of predomesticated lineages. Genome Res. 32(5):864–877. doi: 10.1101/gr.276286.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li G, Ji B, Nielsen J. 2019. The pan-genome of Saccharomyces cerevisiae. FEMS Yeast Res. 19(7):foz064. doi: 10.1093/femsyr/foz064. [DOI] [PubMed] [Google Scholar]
- Liti G, Carter DM, Moses AM, Warringer J, Parts L, James SA, Davey RP, Roberts IN, Burt A, Koufopanou V, et al. 2009. Population genomics of domestic and wild yeasts. Nature. 458(7236):337–341. doi: 10.1038/nature07743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loegler V, Friedrich A, Schacherer J.. 2024a. Genetic variants catalog of 3,034 natural isolates of Saccharomyces cerevisiae. 10.5281/zenodo.12580561. [DOI]
- Loegler V, Friedrich A, Schacherer J.. 2024b. GVCFs of 3,039 natural isolates of Saccharomyces cerevisiae. 10.5281/zenodo.12571280. [DOI]
- Maclean CJ, Metzger BPH, Yang J-R, Ho W-C, Moyers B, Zhang J. 2017. Deciphering the genic basis of yeast fitness variation by simultaneous forward and reverse genetics. Mol Biol Evol. 34(10):2486–2502. doi: 10.1093/molbev/msx151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marr RA, Moore J, Formby S, Martiniuk JT, Hamilton J, Ralli S, Konwar K, Rajasundaram N, Hahn A, Measday V. 2023. Whole genome sequencing of Canadian Saccharomyces cerevisiae strains isolated from spontaneous wine fermentations reveals a new pacific west coast wine clade. 13(8):jkad130. doi: 10.1093/g3journal/jkad130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marsit S, Mena A, Bigey F, Sauvage F-X, Couloux A, Guy J, Legras J-L, Barrio E, Dequin S, Galeote V. 2015. Evolutionary advantage conferred by an eukaryote-to-eukaryote gene transfer event in wine yeasts. Mol Biol Evol. 32(7):1695–1707. doi: 10.1093/molbev/msv057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McCusker JH, Clemons KV, Stevens DA, Davis RW. 1994. Genetic characterization of pathogenic Saccharomyces cerevisiae isolates. Genetics. 136(4):1261–1269. doi: 10.1093/genetics/136.4.1261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morard M, Pérez-Través L, Perpiñá C, Lairón-Peris M, Collado MC, Pérez-Torrado R, Querol A. 2023. Comparative genomics of infective Saccharomyces cerevisiae strains reveals their food origin. Sci Rep. 13(1):10435. doi: 10.1038/s41598-023-36857-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Donnell S, Yue J-X, Saada OA, Agier N, Caradec C, Cokelaer T, De Chiara M, Delmas S, Dutreux F, Fournier T, et al. 2023. Telomere-to-telomere assemblies of 142 strains characterize the genome structural landscape in Saccharomyces cerevisiae. Nat Genet. 55(8):1390–1399. doi: 10.1038/s41588-023-01459-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paradis E, Schliep K. 2019. Ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 35(3):526–528. doi: 10.1093/bioinformatics/bty633. [DOI] [PubMed] [Google Scholar]
- Parapouli M, Vasileiadis A, Afendra A-S, Hatziloukas E. 2020. Saccharomyces cerevisiae and its industrial applications. AIMS Microbiol. 6(1):1–31. doi: 10.3934/microbiol.2020001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peter J, De Chiara M, Friedrich A, Yue J-X, Pflieger D, Bergström A, Sigwalt A, Barre B, Freel K, Llored A, et al. 2018. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature. 556(7701):339–344. doi: 10.1038/s41586-018-0030-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pontes A, Čadež N, Gonçalves P, Sampaio JP. 2019. A quasi-domesticate relic hybrid population of Saccharomyces cerevisiae × S. paradoxus adapted to olive brine. Front Genet. 10:449. doi: 10.3389/fgene.2019.00449. https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2019.00449/full. [accessed 2024 Jun 24]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Auwera GAV der, Kling DE, Gauthier LD, Levy-Moonshine A, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv 201178. 10.1101/201178, 24 July 2018, preprint: not peer reviewed. [DOI] [Google Scholar]
- Preiss R, Tyrawa C, Krogerus K, Garshol LM, van der Merwe G. 2018. Traditional Norwegian Kveik are a genetically distinct group of domesticated Saccharomyces cerevisiae brewing yeasts. Front Microbiol. 9:2137. doi: 10.3389/fmicb.2018.02137. https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2018.02137/full. [accessed 2024 Jun 24]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raj A, Stephens M, Pritchard JK. 2014. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics. 197(2):573–589. doi: 10.1534/genetics.114.164350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruiz J, de Celis M, Martín-Santamaría M, Benito-Vázquez I, Pontes A, Lanza VF, Sampaio JP, Santos A, Belda I. 2021. Global distribution of IRC7 alleles in Saccharomyces cerevisiae populations: a genomic and phenotypic survey within the wine clade. Environ Microbiol. 23(6):3182–3195. doi: 10.1111/1462-2920.15540. [DOI] [PubMed] [Google Scholar]
- Schacherer J, Shapiro JA, Ruderfer DM, Kruglyak L. 2009. Comprehensive polymorphism survey elucidates population structure of Saccharomyces cerevisiae. Nature. 458(7236):342–345. doi: 10.1038/nature07670. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schraiber JG, Akey JM. 2015. Methods and models for unravelling human evolutionary history. Nat Rev Genet. 16(12):727–740. doi: 10.1038/nrg4005. [DOI] [PubMed] [Google Scholar]
- Song G, Dickins BJA, Demeter J, Engel S, Dunn B, Cherry JM. 2015. AGAPE (automated genome analysis PipelinE) for pan-genome analysis of Saccharomyces cerevisiae. PLoS One. 10(3):e0120671. doi: 10.1371/journal.pone.0120671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Strope PK, Skelly DA, Kozmin SG, Mahadevan G, Stone EA, Magwene PM, Dietrich FS, McCusker JH. 2015. The 100-genomes strains, an S. cerevisiae resource that illuminates its natural phenotypic and genotypic variation and emergence as an opportunistic pathogen. Genome Res. 25(5):762–774. doi: 10.1101/gr.185538.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tellini N, De Chiara M, Mozzachiodi S, Tattini L, Vischioni C, Naumova ES, Warringer J, Bergström A, Liti G. 2024. Ancient and recent origins of shared polymorphisms in yeast. Nat Ecol Evol. 8(4):761–776. doi: 10.1038/s41559-024-02352-5. [DOI] [PubMed] [Google Scholar]
- Vasimuddin M, Misra S, Li H, Aluru S. 2019. Efficient architecture-aware acceleration of BWA-MEM for Multicore systems. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). p. 314–324. doi: 10.1109/IPDPS.2019.00041. [DOI]
- Wang M, Li X, Liu X, Hou X, He Y, Yu J-H, Hu S, Yin H, Xie B-B. 2024b. Annotation of 2,507 Saccharomyces cerevisiae genomes. Microbiol Spectr. 12(4):e0358223. doi: 10.1128/spectrum.03582-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Q-M, Liu W-Q, Liti G, Wang S-A, Bai F-Y. 2012. Surprisingly diverged populations of Saccharomyces cerevisiae in natural environments remote from human activity. Mol Ecol. 21(22):5404–5417. doi: 10.1111/j.1365-294X.2012.05732.x. [DOI] [PubMed] [Google Scholar]
- Wang H, Nielsen J, Zhou Y, Lu H.. 2024a. Yeast adapts to diverse ecological niches driven by genomics and metabolic reprogramming. bioRxiv 598782. 10.1101/2024.06.14.598782, 17 June 2024, preprint: not peer reviewed, https://www.biorxiv.org/content/10.1101/2024.06.14.598782v1 [accessed 2024 Jun 21]. [DOI] [Google Scholar]
- Ward CM, Onetto CA, Heuvel SVD, Cuijvers KM, Hale LJ, Borneman AR. 2024. Recombination, admixture and genome instability shape the genomic landscape of Saccharomyces cerevisiae derived from spontaneous grape ferments. PLoS Genet. 20(3):e1011223. doi: 10.1371/journal.pgen.1011223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weiß CL, Pais M, Cano LM, Kamoun S, Burbano HA. 2018. Nquire: a statistical framework for ploidy estimation using next generation sequencing. BMC Bioinformatics. 19(1):122. doi: 10.1186/s12859-018-2128-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng X, Gogarten SM, Lawrence M, Stilp A, Conomos MP, Weir BS, Laurie C, Levine D. 2017. SeqArray—a storage-efficient high-performance data format for WGS variant calls. Bioinformatics. 33(15):2251–2257. doi: 10.1093/bioinformatics/btx145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng D-Q, Wang P-M, Chen J, Zhang K, Liu T-Z, Wu X-C, Li Y-D, Zhao Y-H. 2012. Genome sequencing and genetic breeding of a bioethanol Saccharomyces cerevisiae strain YJS329. BMC Genomics. 13(1):479. doi: 10.1186/1471-2164-13-479. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu YO, Sherlock G, Petrov DA. 2016. Whole genome analysis of 132 clinical Saccharomyces cerevisiae strains reveals extensive ploidy variation. 6(8):2421–2434. doi: 10.1534/g3.116.029397. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
Supplementary Materials
Data Availability Statement
Codes used for all analyses and figures, and catalogs of SNPs, InDels, and CNVs are available in a first Zenodo repository: https://doi.org/10.5281/zenodo.12580561 (Loegler et al. 2024a). gvcf Files are available in a second repository: https://doi.org/10.5281/zenodo.12571280 (Loegler et al. 2024b).
Supplemental material available at G3 online.



