Abstract
Many biological experiments involve studying the differences caused by genetic modifications, including genotypes composed of modifications at more than 1 locus. However, as the genotypes increase in number and complexity, it becomes a major challenge to independently generate and track the necessary number of biological replicate samples. A major development in genetic studies of large numbers of genotypes has been the use of barcode tracking. Inspired by such high-throughput studies, we developed a barcode-based method to track large numbers of independent replicates of a small number of combinatorial genotypes in a pooled format, enabling robust detection of subtle phenotypic differences. To construct a plasmid library of combinatorial genotypes, we utilized a nested serial cloning process to combine gene variants of interest that have associated DNA barcodes. The final plasmids each contain variants of multiple genes of interest, and a combined barcode that specifies the genotype of all the genes while also encoding a random sequence for tracking individual replicates. Sequencing of the pool of barcodes by next-generation sequencing allows the whole population to be studied in a single flask, enabling a high degree of replication even for complex genotypes. Using this approach, we tested the functionality of combinations of yeast, human, and null orthologs of the nucleotide excision repair factor I (NEF-1) complex and found that yeast cells expressing all 3 yeast NEF-1 subunits had superior growth in DNA-damaging conditions. We also assessed the sensitivity of our method by simulating downsampling of barcodes across different degrees of phenotypic differentiation. Our results demonstrate the utility of NICR (nested identification combined with replication) barcodes for high-throughput combinatorial genetic screens and provide a scalable framework for exploring complex genotype–phenotype relationships.
Keywords: DNA barcodes, yeast, Saccharomyces cerevisiae, NER factor I complex, replicates, fungi
Introduction
Reverse genetics experiments, in which specific genetic variants are engineered to study their phenotypic effects, are a foundational tool in biological research. They are critical for understanding gene function at multiple levels: The role of a gene can be determined through its deletion, the molecular mechanism through which it acts can be probed through targeted mutations, and its evolutionary trajectory can be explored by comparing homologous genes. Studying combinations of genetic variants further expands these insights by revealing interactions between genes or mutations. Such studies can determine whether genes have redundant functions (Nowak et al. 1997), determine the order of steps in a pathway (Avery and Wasserman 1992), uncover the molecular basis of physical interactions between or within proteins (Horovitz 1996), and elucidate how genetic variation affects phenotypic divergence and speciation (Brideau et al. 2006).
A challenge in reverse genetics experiments is the need for biological replicates. Studying multiple isogenic individuals per genotype allows researchers to distinguish the effects of genetic variation from environmental variance, experimental noise, and background genetic mutations, enabling more robust statistical conclusions. However, as experiments scale to include both combinatorial genotypes and biological replicates, the required number of samples grows rapidly. Experiments comparing larger numbers of genotypes require even more replicates per genotype to maintain statistical power, yet individually collecting data on more than a few dozen samples is often impractical. This trade-off results in researchers choosing to either limit replicates, reducing statistical power, or study only a subset of possible genetic combinations, potentially missing important genetic interactions.
Pooled experiments offer a solution to this challenge of scalability. In pooled experiments, a large population of genetically distinct individuals is created, each differing in the sequence of the genes of interest and a phenotypically inert DNA barcode to distinguish biological replicates. The pool of cells is then subjected to an assay that enriches genotypes that have a phenotype of interest, such as growth in a challenging environment or cell-sorting of a fluorescent reporter. Then, next-generation sequencing (NGS) is used to quantitatively identify the enriched genetic variants and their associated barcodes. This framework has been widely used to study the phenotypic effect of variant libraries across diverse biological contexts (Shoemaker et al. 1996; Winzeler et al. 1999; Wong et al. 2016; Weile and Roth 2018; Poelwijk et al. 2019; Liu et al. 2020; Bendel et al. 2024). Here, we apply this strategy to a high number of replicates of combinatorial genotypes rather than a high number of genotypes. In our approach, genotypic variants are successively combined together on a plasmid while generating combinatorial DNA barcodes, which we term nested identification combined with replication (NICR) barcodes. We applied the NICR barcode method to generate combinations of yeast or human homologs of 3 proteins in the nucleotide excision repair factor I (NEF-1) complex and tracked thousands of independent replicates to examine the functionality of the combined NEF-1 complexes in yeast.
Materials and methods
Cloning strategy
Strains, plasmids, and oligonucleotides used in this study are listed in Supplementary Tables 1–3.
We generated 27 plasmid pools, each carrying 3 gene cargos, in a 3-step cloning process. Each step incorporates an additional gene to create the 3-gene complex and expands the NICR barcode to account for this addition.
First, we PCR-amplified the 3 yeast NEF-1 genes (RAD1, RAD10, and RAD14), 3 human NEF-1 genes (ERCC4, ERCC1, and XPA), and 3 noncoding mouse DNA sequences (RAD1-neg, RAD10-neg, and RAD14-neg) using Herculase II Fusion DNA Polymerase (Agilent) to obtain our gene cargos. The yeast genes were amplified from yeast genomic DNA (strains MSY116, MSY119, and MSY121, respectively), and the amplicons were designed to include 228–375 base pairs (bp) of upstream promoter sequence and 299–763 bp of downstream terminator sequence, based on RNA-seq results identifying the start and end of the transcripts (Albert et al. 2014), and utilizing the Saccharomyces Genome Database for DNA sequences (Engel et al. 2025), Genome Release R64-3-1 (Engel et al. 2022). The human genes were amplified from yeast strains that had individual human genes integrated at the homologous yeast loci (strains MSY121, MSY116, and MSY119, respectively); the integrated ERCC4 and ERCC1 cDNA sequences were derived from plasmids gifted from Phil Hieter, and the XPA cDNA sequence was ordered from the hORFeome1.1 collection (Horizon Discovery). We used mouse noncoding DNA for the negative control sequences to minimize homology to the yeast genome. We chose the sequences by scanning intronic regions of the mouse genome for neutral, noncoding DNA whose lengths were comparable to the corresponding NEF-1 gene. We used 3,752 bp of a Tyr intron for RAD1-neg, 1,586 bp of a Dct intron for RAD10-neg, and 1,549 bp of the Tyr promoter for RAD14-neg. C57BL/6J mouse genomic DNA was a gift from Bill Pavan. Each forward primer included a 4-nucleotide hardcoded barcode that was specific to the cargo and a 10-nucleotide stretch of random nucleotides to generate the replicate barcode. Primer sequences are listed in Supplementary Table 3.
The 9 ICR plasmid pools were constructed by cloning these amplicons into pYTK096 (Lee et al. 2015) digested with BsaI-HFv2 and MluI. BsaI generated a linear vector fragment, and MluI cut between the 2 BsaI sites to further reduce undigested plasmid background. The PCR primers used to amplify the cargo insert DNA molecules introduced 20 bp of homology to the ends of the BsaI-cut vector. We assembled our 9 inserts into digested pYTK096 using Gibson Assembly (NEB) and plated individually on LB + Kanamycin plates. Colonies from all 9 single-gene ICR libraries were scraped and plasmid DNA was isolated using a Plasmid Plus Maxi kit (QIAGEN).
Next, we assembled our 2-NICR plasmids by combining the first and second genes of NEF-1 in our second round of cloning. Set 1 (RAD1, ERCC4, and RAD1-neg) plasmid pools were digested with AgeI-HF and BstEII-HF (NEB) sequentially and treated with Quick CIP phosphatase (NEB) to generate vector fragments. Set 2 (RAD10, ERCC1, and RAD10-neg) plasmid pools were also digested with AgeI-HF and BstEII-HF to generate insert DNA pools. We separately assembled all 9 possible combinations of fragments using T4 ligase (NEB) and plated on LB + Kanamycin plates. All colonies were scraped and plasmid DNA was isolated using the Plasmid Plus Maxi kit.
We assembled our complete 3-NICR plasmids in our third round of cloning. The 2-NICR (RAD1/RAD10, RAD1/ERCC1, RAD1/RAD10-neg, ERCC4/RAD10, ERCC4/ERCC1, ERCC4/RAD10-neg, RAD1-neg/RAD10, RAD1-neg/ERCC1, RAD1-neg/RAD10-neg) plasmid pools were digested with NheI-HF and SphI-HF (NEB) sequentially and treated with Quick CIP phosphatase to generate the vector fragments. Plasmid pools were also digested with BsiWI-HF (NEB) to remove background vectors from previous rounds of cloning (i.e. ICR plasmids that were not digested or assembled correctly when generating 2-NICR plasmids), as in the “kill cut” strategy described by Poelwijk et al. (2019). Set 3 (RAD14, XPA, and RAD14-neg) plasmids were digested with NheI-HF and SphI-HF to generate the insert DNA pools. We separately assembled all 27 possible combinations of fragments using T4 ligase and plating on LB + Kanamycin plates. All colonies were scraped and plasmid DNA was isolated using the Plasmid Plus Maxi kit.
PacBio long-read sequencing
DNA was isolated from each of our initial 9 ICR plasmid libraries for long-read sequencing to determine whether the hardcoded barcode component was correctly paired with the correct gene cargo. ICR plasmid pool DNA was digested with NotI-HF (NEB) to excise linear fragments carrying the barcodes and gene cargos, from which 2.3 µg of DNA was cleaned twice with Ampure PB beads (Pacific Biosciences). A bead:DNA ratio of 0.45 was used to remove fragments shorter than 3 kilobases. A library was generated using SMRTbell prep kit 3.0 (Pacific Biosciences) through the adapter-barcoded workflow, and sequenced on a Sequel IIe instrument using an 8 M SMRT cell with version 2.0 chemistry (Pacific Biosciences). We used a 2-h pre-extension and a 15-h movie. From the resultant circular consensus sequencing reads, we used custom Python scripts to extract the 14-nucleotide ICR barcode and the associated cargo. For each unique identified barcode, the sequence of the associated cargo was classified as correct (expected error-free gene cargo matching the hardcoded barcode), misassociated (unexpected gene cargo), containing a primer dimer (short primer dimer as cargo), or containing SNPs or indels (expected gene cargo but without perfect matching). Barcodes that were associated with both correct and incorrect cargos were discarded. We calculated the total number of barcodes for each of the 9 genotypes and the percentage of barcodes corresponding to each class. We generated a list of correct ICR barcodes for use in downstream analyses.
Illumina sequencing
We used Illumina short read sequencing to characterize our initial 3-NICR library and to measure barcode abundances during the MMS experiment. From either a plasmid prep or yeast genomic DNA, we first amplified the 3-NICR barcode using KAPA HiFi HotStart (Roche) and custom primers. We purified the PCR product using the MinElute PCR Purification Kit (QIAGEN). We next amplified 10 ng of amplicon DNA with Nextera adapter Index primers (Illumina) and KAPA HiFi HotStart. We ran the final PCR product on a 2% SizeSelect Egel (Invitrogen) and collected the correctly sized band. Finished libraries were quantified by Qubit 4 fluorometer (Invitrogen). Libraries were diluted to 4 nM and pooled before sequencing on a NovaSeq 6000 instrument (Illumina).
Illumina reads were paired using PEAR (Zhang et al. 2014 ) and then, using custom Python scripts, we extracted the entire 3-NICR barcode from each sequence read. From the sequencing output of the plasmid pool, we filtered for 3-NICR barcodes that contained 3 previously identified ICR barcodes to generate a list of correct barcodes for downstream analyses.
Yeast strain construction
rad1Δrad10Δ (MSY101) was a gift from Phil Hieter. We constructed rad1Δ rad10Δ rad14Δ by replacing yeast RAD14 in the rad1Δ rad10Δ strain via homology-directed repair with the NatMX cassette. We amplified NatMX from MSp2 via PCR using primers oMM15 and oMM16 and transformed it into the rad1Δ rad10Δ strain using a standard lithium acetate protocol (Becker and Lundblad 2001). RAD14 deletion was confirmed via PCR. We designated the rad1Δ rad10Δ rad14Δ strain MSY125. We next transformed our triple knockout strain with plasmids carrying Gal-Cas9 and a gRNA to target the ura3Δ locus to increase the transformation efficiency of our recombinant NEF-1 cargos, creating strain MSY169.
Yeast transformation
We digested our 27 plasmid pools with NotI-HF, MluI-HF, BsiWI-HF, and SalI-HF and directly transformed this reaction into the rad1Δrad10Δrad14Δ strain (MSY169) using the standard lithium acetate transformation protocol. Digestion by NotI-HF generated linear DNA fragments consisting of the 3-NICR barcode, 3 gene cargos, and the URA3 promoter, gene, and terminator, with homology arms targeting integration at the ura3Δ locus. Any contaminating background plasmids that did not receive 1 or more of the 3 components contained “kill cut” restriction sites (MluI-HF, BsiWI-HF, or SalI-HF) between the left and right homology arms (Poelwijk et al. 2019), such that cleavage would block genomic integration of these fragments; all 3 restriction sites were absent from the fully assembled 3-cargo plasmids. A single colony of the rad1Δ rad10Δ rad14Δ strain was grown to an optical density of 0.6 in 500 mL of media, which was used for all 27 transformations, such that any background mutations would either be fixed across genotypes or occur randomly without association to any particular plasmid genotype. Approximately 180 million cells were used for each transformation with 3.5 µg of one of the digested 3-NICR plasmid pools. We selected transformants on CSM -His -Leu -Ura (Sunrise Science Products) + 2% galactose plates, then washed resultant colonies off into 27 individual yeast transformant pools, which were frozen down as glycerol stocks.
Testing MMS sensitivity
We made individual overnight cultures of the 27 yeast pools in 5 mL of CSM -His -Leu -Ura media. The next day, we pooled together an equal number of cells from each overnight culture and used this combined pool to inoculate 300-mL cultures that had 0, 0.005%, 0.01%, 0.015%, or 0.02% MMS to an optical density of 0.025. The cultures were grown at 30°C for 48 h. Every 12 h, we froze down approximately 60 million cells and diluted back to an optical density of 0.025 if they had reached an optical density greater than 0.2. Cells were also frozen immediately after forming the pool. We extracted DNA from the frozen cells using the yeast genomic DNA protocol with the DNeasy Blood and Tissue Kit (QIAGEN), from which we prepared Illumina libraries of the 3-NICR barcodes as above.
Data analysis
Using custom Python and R scripts, we used our list of filtered correct 3-NICR barcodes to track replicates and genotypes across the MMS experiment. For each barcode, we normalized reads to account for differences in sequencing depth across time points and MMS concentrations. First, we calculated the percentage of reads at a given time point and MMS concentration for each correct 3-NICR barcode. To adjust for differences in baseline growth, we then normalized each barcode's frequency to its frequency at the same time point but without MMS. A linear model was fit to the normalized barcode counts across the genotypes at a given time and MMS concentration, and genotypes significantly different from each other were detected using estimated marginal means, as implemented in the R package emmeans (v.1.10.7). Adjusted P-values were calculated using the Benjamini–Hochberg false discovery rate correction method across all time points and MMS concentrations. Significant differences with effect size greater than 0.2 are listed in Supplementary Table 4, with effect size calculated as the model estimate divided by the standard deviation.
For the barcode downsampling analysis, we used a custom R script to simulate downsampling 3-NICR barcodes from the 0.005%, 0.01%, and 0.015% MMS concentrations. For each tested set size (ranging from 4 to 150 3-NICR barcodes per genotype), we randomly sampled unique 3-NICR barcodes from the 0 MMS, 48-h time point and tracked their relative abundance at the relevant MMS concentration at the same time point. A linear model was fitted to the normalized barcodes for each downsampling to detect significant differences between YYY and other genotypes and to calculate the effect size. The analysis was repeated 500 times for each tested number of 3-NICR barcodes. Across all simulations, the frequency with which we were able to detect a statistically significant difference in growth between YYY and NNN with an effect size > 0.2 was determined. Ninety-five percent confidence intervals were calculated using a binomial test through the Clopper–Pearson method.
Results
Generation of highly replicated combinatorial genotypes
We developed a strategy to create a plasmid library of combinatorial genotypes such that each plasmid carries a barcode specifying its genotype and replicate information. For the 3-subunit NEF-1 complex (Guzder et al. 1996), we sought to make all possible combinations of yeast, human, or null subunits, for 27 possible genotypes, with each genotype linked to many unique barcodes.
We started by associating our 9 genetic cargos—yeast, human, or null alleles for each of the 3 NEF-1 subunits—to DNA barcodes that we refer to as ICR barcodes (identification combined with replication barcode; Fig. 1a). The ICR barcodes contain a 4-nucleotide hardcoded sequence such that the cargo could be determined directly from the barcode and a random 10-nucleotide sequence for marking independent replicates. The ICR barcode is introduced via a primer used to PCR-amplify the cargo for cloning onto a plasmid. We aimed for 500 transformed Escherichia coli colonies per cargo, from which we generated 9 single-gene plasmid pools (Fig. 1a).
Fig. 1.
Strategy for generating highly replicated barcoded genotypes. a) We generated 9 single-gene (ICR) plasmid libraries, each carrying a yeast, human, or null gene of the NEF-1 complex. Each plasmid library has a 4-nucleotide genotype-specific sequence and a random 10-nucleotide sequence for tracking replicates. The yeast and human orthologs are under the control of the promoter and terminator of the yeast homolog. We combined the plasmid libraries into 27 three-gene NICR libraries. b) First, the Set 2 plasmid library barcodes and cargos were cloned between the barcodes and cargos of the Set 1 plasmid libraries to generate 2-NICR libraries. c) We then repeated with the Set 3 barcodes and cargos to produce 3-NICR libraries. Gene diagrams are not to scale. d) Priming sequences flank the NICR barcode for NGS of all fixed and replicate barcodes.
We then sequentially combined the single-gene libraries to create plasmids carrying all 27 possible combinations of human, yeast, or null genes for the 3 components of the NEF-1 complex. The 9 single-gene libraries were categorized into 3 sets: Set 1 included libraries containing RAD1, its human ortholog ERCC4, and RAD1-neg; Set 2 consisted of the RAD10, ERCC1, and RAD10-neg libraries; and Set 3 was comprised of the RAD14, XPA, and RAD14-neg libraries. The sets were designed to allow restriction digestion and cloning of the barcode + gene fragment of Sets 2 and 3 to be inserted between the barcode and gene of Sets 1 and 2, respectively. Thus, in each insertion step, the barcodes are placed adjacent to one another, creating larger combined barcodes that specify all the genes encoded on the plasmid. Set 1 and Set 2 were first combined to generate the 2-NICR plasmid library (Fig. 1b), which was then combined with the Set 3 library (Fig. 1c). The final 3-NICR plasmids contain a 54-nucleotide barcode that reports the identities of the 3 associated genes through the hardcoded components and denotes independent replicates through the random components. The final barcode is flanked by priming sites for amplification and NGS genotyping (Fig. 1d). We aimed for 500 colonies for each of the 27 combinatorial plasmid libraries.
Highly replicated libraries contain correct genotypes and associated barcodes
We used high-fidelity long read circular consensus sequencing to perform a quality check of the contents of our initial ICR libraries by sequencing the barcodes and their adjacent genes. This allowed us to measure the rate of various cloning errors and identify barcodes associated with incorrect constructs. By creating a list of barcodes to ignore in downstream analyses, we addressed a difficulty inherent to pooled experiments, where physically removing incorrect plasmids is not feasible. Assuming the proportion of incorrect plasmids is low, the burden of carrying these plasmids through the experiment and sequencing them is minimal.
From approximately 1 million reads, we identified 317 to 2,031 unique barcodes for each single-gene library (Fig. 2a), using a threshold of 10 reads per barcode. We analyzed the gene sequences associated with each barcode and found that the majority (68–99%) were perfectly correct (Fig. 2a). The most common error observed was primer dimers being cloned in place of the amplified gene cargos. The maximum observed proportion of primer dimers was 31% for RAD1-neg, followed by 11% for RAD1 and 10% for RAD10; in the remaining 6 plasmid pools, primer dimers accounted for less than 5% of cloned inserts.
Fig. 2.
Characterization of ICR and NICR plasmid pools. a) The single-gene ICR plasmids were sequenced by long-read sequencing. Gene cargos were determined to be correct or incorrect. Incorrect cargos included those that contained the wrong insert (primer dimers or the wrong gene) or that had mutant versions of the right insert (SNPs, indels). b) The frequency of hardcoded barcodes with the correct associated gene and with each other possible but wrong cargo. c) The percent of barcodes for each 3-NICR genotype for which each 3-NICR barcode contained ICR barcodes that are associated with a correct or incorrect cargo. Each 3-letter code corresponds to the combination of gene cargos, in the order of Set 1, Set 2, Set 3 (Set 1: RAD1 (Y), ERCC4 (H), or RAD1-neg (N); Set 2: RAD10 (Y), ERCC1 (H), or RAD10-neg (N); Set 3: RAD14 (Y), XPA (H), or RAD14-neg (N); H: human cargo, Y: yeast cargo, N: null cargo).
Spontaneous variants can arise during PCR amplification of cargos for cloning, including single-nucleotide variants or insertion/deletion (indel) variants. We found such mutations in approximately 0.5% of cargos, mostly in the form of small deletions at the very start or end of the cargo sequence (Supplementary Table 5), likely resulting from errors during synthesis of the primer used to amplify the homologs. Since the primer binding sites were chosen to be far upstream and downstream of the coding sequences, these mutations are unlikely to affect gene function, and we opted not to remove the associated barcodes from downstream analyses. Outside the primer regions, we found very few instances of SNPs and indels, affecting less than 0.05% of cargos.
Occasionally, barcodes were associated with incorrect genes, which affected less than 0.01% of barcodes (Fig. 2b). The most we observed was 4 cases of the ERCC1 hardcoded barcode sequence linked to RAD10 gene cargo. Despite separately cloning all 9 single-gene libraries, this incorrect association could result from barcode primer cross-contamination, template DNA cross-contamination, or template-switching during long-read sequencing.
Next, we investigated the proportion of barcodes associated with cargos with errors in the final 3-gene plasmid pools using short read sequencing of the 3-NICR barcodes. Primer dimers were the largest population of erroneous cargos in the ICR pools. To minimize their presence, we used gel-electrophoresis-based size selection during NICR library cloning. We assessed the effectiveness of primer dimer removal (Supplementary Fig. 1) and found that the abundance of barcodes corresponding to primer dimers was substantially reduced (Fig. 2c). Among all constructs, RAD1-neg had the largest proportion of primer dimer-associated 3-NICR barcodes, likely because it was always part of the “vector” during the nested cloning (Fig. 1, b and c), causing its correct cargos and primer dimer cargos to migrate similarly on a gel. Still, the representation of primer dimer barcodes for RAD1-neg was lower in the 3-NICR pools than predicted from their representation in the ICR pool. Overall, the 3-NICR plasmid pools mostly contained barcodes associated with correct cargos (Fig. 2c); for 96% of 3-NICR barcodes, all 3 component ICR barcodes corresponded to correct cargos.
The long-read sequencing showed high agreement between the expected and actual cargo for the ICR cargos. The proportion of correct ICR barcodes was increased in the 3-NICR final library. While we generally recommend long-read sequencing for complete certainty and to confirm the identity of the cargo associated with each ICR barcode, it could be skipped if the hardcoded barcode component is included and depending on the requirements of the application.
Testing chimeric NEF-1 complexes in yeast
Next, we tested whether the highly replicated plasmid pools could be used to distinguish phenotypes of chimeric NEF-1 complexes in a pooled experiment. The nucleotide excision repair pathway is required to repair bulky lesions in DNA (Schärer 2013). Following bulky lesion recognition, Rad14 recruits Rad1 and Rad10 to excise the damaged DNA strand, which is then replaced through DNA synthesis. Nucleotide excision repair is required for yeast survival of the genotoxic drug methyl methanesulfonate (MMS) (Prakash and Prakash 1977). Past work has shown that the function of Rad1 and Rad10 in MMS survival can be partially replaced by their human homologs, ERCC4 (also known as XPF) and ERCC1, either individually or together (Hamza et al. 2020), while the ability to replace Rad14 with XPA has not been previously tested. To test the functionality of all combinations of yeast or human homologs of the 3 NEF-1 subunits, we integrated the 27 3-NICR plasmid pools into the ura3Δ locus in rad1Δ rad10Δ rad14Δ yeast, pooled all transformants together, and grew them in media containing 0, 0.005%, 0.01%, 0.015%, or 0.02% MMS. To understand the effect of MMS on each genotype's growth, we collected cells every 12 h for 48 h and then sequenced the 3-NICR barcode by NGS to quantify abundance of each genotype over time.
In 0.005%, 0.01%, and 0.015% MMS, cells with the yeast homologs of all 3 NEF-1 genes had significantly superior growth to cells with any other genotype (Fig. 3; Supplementary Table 4). The superior growth was apparent at 24, 36, and 48 h in 0.005% and 0.01% MMS, while in 0.015% MMS it took 48 h to appear, likely because in 0.015% MMS growth was generally poor (Supplementary Fig. 2). Growth was even poorer in 0.02% MMS, and we did not detect differences in growth at any time point. When looking at the individual replicates, strains containing all yeast homologs have visibly superior growth in 0.005% and 0.01% MMS by 48 h (Supplementary Fig. 3).
Fig. 3.
3-NICR yeast library growth in MMS. 3-NICR libraries were grown in media containing 0.005%, 0.01%, 0.015%, and 0.02% MMS. Samples were collected every 12 h for 48 h, and the abundance of each 3-NICR barcode was measured by NGS as a proxy for cell growth. Each 3-NICR barcode abundance was normalized to the total read count in that sample and then normalized to its abundance in media without MMS from the same time point. Each line represents the average normalized abundance across all replicates for a single genotype. Genotype identities are specified as in the 3-letter code used in Fig. 2c. Error bars signify the standard error of the mean.
Surprisingly, all of the genotypes composed of a mix of yeast and human subunits had poor growth, comparable to the genotypes lacking NEF-1 subunits, whereas a previous study found that the human homologs of RAD1 and RAD10 could partially confer MMS resistance. The difference in results could be due to the construction of the experimental strain; for example, encoding the genes from the URA3 locus or in an array could affect their expression levels. Additionally, while we used a strain derived from the one tested in the original study, the deletion of RAD1, RAD10, and our deletion of RAD14 results in a mutator phenotype (Kunz et al. 1990) and could have led to additional mutations that affected the phenotype. Alternatively, differences in the experimental setup, such as the source of MMS used, or the co-culture of genotypes in our study versus separate growth in the original study, could contribute to the observed discrepancies.
Replication barcodes allow detection of small effects
Since we used random DNA barcodes generated during cloning to track biological replicates, the number of replicates scaled to the number of colonies resulting from a plasmid cloning reaction, which allowed us to generate hundreds of replicate samples. Next, we sought to determine the minimum number of replicates that would have been needed to detect the observed phenotypic differences. We randomly selected subsets of 3-NICR barcodes from the 0.005%, 0.01%, and 0.015% MMS data and determined the frequency with which we were able to detect a growth difference between strains with yeast NEF-1 compared to fully null NEF-1 (Fig. 4). In 0.01% MMS, in which yeast NEF-1 conferred the largest phenotypic benefit, we had over 90% power to detect this difference with 4 barcodes per genotype, while in 0.005% MMS and 0.015% MMS, approximately 25 and 140 barcodes were required, respectively. These numbers of barcodes are all accessible for a pooled experiment, whereas for a non-pooled experiment, the number of replicates would likely limit detection to the strongest phenotype. These results demonstrate the power of NICR barcoding to fully capture genotype–phenotype relationships with high statistical confidence by leveraging the scalability and efficiency of pooled experiments.
Fig. 4.
Downsampling analysis of 3-NICR barcodes. For each downsample (ranging from 4–150 3-NICR barcodes per genotype), unique 3-NICR barcodes were randomly sampled and their abundance in MMS media relative to media without MMS at the final time point was compared. We repeated the analysis 500 times for each MMS concentration. We determined the frequency that the yeast NEF-1 genotype had a statistically significant improvement in growth compared to fully null NEF-1. Statistically significant improved growth was defined as the adjusted P < 0.05 and effect size > 0.2. Error bars signify the 95% confidence interval.
Discussion
Replication is essential for statistical confidence in reverse genetics experiments. However, designing an experiment with replication presents 2 major challenges, both of which are heightened in experiments studying genotype combinations. The first challenge is that replication can result in a large number of total samples. This often leads researchers to either choose a subset of genotypes for comparison or use a limited number of replicates. Even with these adjustments, the number of samples can be burdensome. For instance, if we had individually generated the 8 human-yeast chimeric NEF-1 genotypes plus a negative control, each with 4 replicates, we would still have 36 samples to carry through 5 concentrations of MMS. This would be challenging experimentally while providing limited statistical power. Generating and studying samples in pools alleviates this difficulty. In our approach, replicates are generated as distinct colonies during a cloning reaction. Instead of selecting individual colonies, all the colonies are collected together, allowing a single cloning reaction to generate dozens to hundreds of replicates. Then, the phenotypic assay is done simultaneously on the entire pool of replicates, which means that increasing the number of replicates increases the scale of the assay rather than the number of assays. Finally, DNA barcodes can easily be designed to distinguish millions of samples (Johnson et al. 2023), and a single NGS run can generate billions of reads for phenotypic determination.
The second challenge is to generate replicates in a way that avoids the replicates sharing background mutations that are absent from the other genotypes. If a single mutant cell line is generated and then split to produce replicates, any background mutation present in the original cell line would be shared among all replicates. The background mutations could cause phenotypic effects that would be mistakenly attributed to the targeted genetic modification rather than to these unintended mutations. Such background effects could arise from point mutations, aneuploidies, copy number variation, or even nonchromosomal heritable effects. Avoiding shared background is particularly difficult when generating combinatorial genotypes: If a genotype is generated in multiple sequential steps, such as knocking out several genes, then, to avoid shared background mutations, all replicates of all genotypes would need to be independently generated from the first step. This imposes a substantial experimental burden even with a modest number of replicates, and becomes even more demanding when aiming for high replication to enhance statistical power. Moreover, producing replicates independently from the first stage of genotype construction increases the likelihood that each replicate will acquire unique background mutations, leading to increased phenotypic variation among the replicates and decreased power to detect effects resulting from the constructed genotype.
A solution to this paradox is to have all the lines across all genotypes and replicates share a common genetic background as late as possible during genotype construction. In the approach described here, we transformed a single receiver yeast strain with a library of different genotypes delivered by integrating plasmids. Thus, most background mutations in the receiver strain will be constant among all the genotypes, while even de novo background mutations present heterogeneously in the receiver strain culture will be randomly distributed among the genotypes. We also minimized the possibility of local background mutations carried on the NICR plasmids by sequence-verifying cargos by long-read sequencing.
Expanding this approach to a larger number of genes could be eased through simple modifications of the cloning strategy. For our cloning method, we sought restriction enzymes that did not cut within any of the cargo genes and were not duplicated across cloning steps. Additionally, we included a “kill-cut” to remove intermediate background plasmids (Poelwijk et al. 2019). As a result, we needed 9 total restriction enzymes to insert the 3 cargo pools. While such enzymes were readily identified for this set of cargo genes, they would be increasingly harder to find as the number of cargos or cloning steps increases. However, this could be avoided by designing libraries using restriction enzymes with compatible sticky ends (Wong et al. 2016) or type IIS restriction enzymes, both of which enable the restriction sites to be removed from the cloning product, such that the restriction enzymes can be reused in later rounds of cloning. However, this still requires those select restriction sites to be absent or removed from the gene cargos. Another potential limitation is that we chose to clone each combination of genotypes individually and then combined all 27 pools into the final pool. Pooling all cargos at each successive cloning step could facilitate a much larger number of gene combinations. While adjusting the cloning method can minimize these constraints and more cargos can be included, library size would ultimately be limited by plasmid size and stability, transformation efficiency, or the length or quantity of reads from barcode sequencing.
The NICR method could be expanded to combine large numbers of variants per gene. For example, this same pooled approach could be used to combine deep mutational scanning libraries to study the effects of combinatorial genetic variants in high throughput. Sampling from this space would reveal epistasis and functional constraints across combinations of mutations. However, the number of possible unique genotypes increases dramatically with the number of variable positions and accommodating an increase in genotypes would likely limit the number of replicates for each genotype.
Pooled phenotyping requires the abundance of cells in the final pool to reflect their phenotype. This is straightforward when the phenotype has an effect on fitness, as fitter strains will grow faster than less fit strains and thereby have higher representation in the pool. In addition, assays have been developed for measuring many other phenotypes through a cellular abundance readout. Many protein-based phenotypes, including protein expression, enzymatic activity, or protein–protein interaction, can be quantitatively assayed in a pool using approaches such as fluorescence-activated cell sorting to convert non-fitness phenotypes to differences in cellular abundance (Adams et al. 2016; Matreyek et al. 2018; Amorosi et al. 2021). Single-cell sequencing has enabled genome-wide phenotypes to be measured for distinct genotypes in a pool, such as transcriptome or chromatin accessibility perturbations (Adamson et al. 2016; Rubin et al. 2019). Thus, a large diversity of phenotypes can be studied with NICR barcodes. Furthermore, the NICR barcode approach can be applied across a variety of organismal systems beyond yeast; it can be applied to any system in which DNA can be introduced and individual phenotypes measured in a pool. We envision that it would be particularly useful for cell culture experiments, as heterogeneity in cell lines is common (Zhu et al. 2023). The NICR approach can even be utilized beyond cell-based assays, as strategies have been developed to carry out pooled genetic experiments in multicellular organisms such as Caenorhabditis elegans and zebrafish (Parvez et al. 2021; Stevenson et al. 2023).
Supplementary Material
Acknowledgments
We thank all members of the Sadhu lab for helpful discussions. We thank Phil Hieter for plasmids. PacBio and Illumina sequencing was performed at the NIH Intramural Sequencing Center. This work utilized the computational resources provided by the NIH HPC Biowulf Cluster (http://hpc.nih.gov).
Contributor Information
Molly Monge, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Simone M Giovanetti, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Apoorva Ravishankar, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Meru J Sadhu, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Data availability
Strains and plasmids are available upon request, and are described in Supplementary Tables 1 and 2, respectively. Supplementary Tables 1–4 are provided in Supplementary File 1, and all Supplementary Figures and Supplementary Table 5 are provided in Supplementary File 2. Code has been deposited on GitHub with a permanent repository on Zenodo available at DOI: 10.5281/zenodo.15642815. Sequencing read data used in this study is available at the Sequence Read Archive with BioProject accession PRJNA1273127.
Supplemental material available at G3 online.
Funding
This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (1ZIAHG200401).
Literature cited
- Adams RM, Mora T, Walczak AM, Kinney JB. 2016. Measuring the sequence-affinity landscape of antibodies with massively parallel titration curves. eLife. 5:e23156. doi: 10.7554/elife.23156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Adamson B, Norman TM, Jost M, Cho MY, Nuñez JK, Chen Y, Villalta JE, Gilbert LA, Horlbeck MA, Hein MY, et al. 2016. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell. 167(7):1867–1882.e21. doi: 10.1016/j.cell.2016.11.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Albert FW, Treusch S, Shockley AH, Bloom JS, Kruglyak L. 2014. Genetics of single-cell protein abundance variation in large yeast populations. Nature. 506(7489):494–497. doi: 10.1038/nature12904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Amorosi CJ, Chiasson MA, McDonald MG, Wong LH, Sitko KA, Boyle G, Kowalski JP, Rettie AE, Fowler DM, Dunham MJ. 2021. Massively parallel characterization of CYP2C9 variant enzyme activity and abundance. Am J Hum Genetics. 108(9):1735–1751. doi: 10.1016/j.ajhg.2021.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avery L, Wasserman S. 1992. Ordering gene function: the interpretation of epistasis in regulatory hierarchies. Trends Genet. 8(9):312–316. doi: 10.1016/0168-9525(92)90263-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Becker DM, Lundblad V. 2001. Introduction of DNA into yeast cells. Curr Protoc Mol Biol. Chapter. 13(1):Unit13.7. doi: 10.1002/0471142727.mb1307s27. [DOI] [PubMed] [Google Scholar]
- Bendel AM, Faure AJ, Klein D, Shimada K, Lyautey R, Schiffelholz N, Kempf G, Cavadini S, Lehner B, Diss G. 2024. The genetic architecture of protein interaction affinity and specificity. Nat Commun. 15(1):8868. doi: 10.1038/s41467-024-53195-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brideau NJ, Flores HA, Wang J, Maheshwari S, Wang X, Barbash DA. 2006. Two Dobzhansky–Muller genes interact to cause hybrid lethality in Drosophila. Science. 314(5803):1292–1295. doi: 10.1126/science.1133953. [DOI] [PubMed] [Google Scholar]
- Engel SR, Aleksander S, Nash RS, Wong ED, Weng S, Miyasato SR, Sherlock G, Cherry JM. 2025. Saccharomyces Genome Database: advances in genome annotation, expanded biochemical pathways, and other key enhancements. Genetics. 229(3):iyae185. doi: 10.1093/genetics/iyae185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Engel SR, Wong ED, Nash RS, Aleksander S, Alexander M, Douglass E, Karra K, Miyasato SR, Simison M, Skrzypek MS, et al. 2022. New data and collaborations at the Saccharomyces Genome Database: updated reference genome, alleles, and the Alliance of Genome Resources. Genetics. 220(4):iyab224. doi: 10.1093/genetics/iyab224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guzder SN, Sung P, Prakash L, Prakash S. 1996. Nucleotide excision repair in yeast is mediated by sequential assembly of repair factors and not by a pre-assembled repairosome. J Biol Chem. 271(15):8903–8910. doi: 10.1074/jbc.271.15.8903. [DOI] [PubMed] [Google Scholar]
- Hamza A, Driessen MRM, Tammpere E, O’Neil NJ, Hieter P. 2020. Cross-species complementation of nonessential yeast genes establishes platforms for testing inhibitors of human proteins. Genetics. 214(3):735–747. doi: 10.1534/genetics.119.302971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horovitz A. 1996. Double-mutant cycles: a powerful tool for analyzing protein structure and function. Fold Des. 1(6):R121–R126. doi: 10.1016/s1359-0278(96)00056-9. [DOI] [PubMed] [Google Scholar]
- Johnson MS, Venkataram S, Kryazhimskiy S. 2023. Best practices in designing, sequencing, and identifying random DNA barcodes. J Mol Evol. 91(3):263–280. doi: 10.1007/s00239-022-10083-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kunz BA, Kohalmi L, Kang XL, Magnusson KA. 1990. Specificity of the mutator effect caused by disruption of the RAD1 excision repair gene of Saccharomyces cerevisiae. J Bacteriol. 172(6):3009–3014. doi: 10.1128/jb.172.6.3009-3014.1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee ME, DeLoache WC, Cervantes B, Dueber JE. 2015. A highly characterized yeast toolkit for modular, multipart assembly. ACS Synth Biol. 4(9):975–986. doi: 10.1021/sb500366v. [DOI] [PubMed] [Google Scholar]
- Liu Z, Miller D, Li F, Liu X, Levy SF. 2020. A large accessory protein interactome is rewired across environments. eLife. 9:e62365. doi: 10.7554/elife.62365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Matreyek KA, Starita LM, Stephany JJ, Martin B, Chiasson MA, Gray VE, Kircher M, Khechaduri A, Dines JN, Hause RJ, et al. 2018. Multiplex assessment of protein variant abundance by massively parallel sequencing. Nat Genet. 50(6):874–882. doi: 10.1038/s41588-018-0122-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nowak MA, Boerlijst MC, Cooke J, Smith JM. 1997. Evolution of genetic redundancy. Nature. 388(6638):167–171. doi: 10.1038/40618. [DOI] [PubMed] [Google Scholar]
- Parvez S, Herdman C, Beerens M, Chakraborti K, Harmer ZP, Yeh J-RJ, MacRae CA, Yost HJ, Peterson RT. 2021. MIC-Drop: a platform for large-scale in vivo CRISPR screens. Science. 373(6559):1146–1151. doi: 10.1126/science.abi8870. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poelwijk FJ, Socolich M, Ranganathan R. 2019. Learning the pattern of epistasis linking genotype and phenotype in a protein. Nat Commun. 10(1):4213. doi: 10.1038/s41467-019-12130-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prakash L, Prakash S. 1977. Isolation and characterization of MMS-sensitive mutants of Saccharomyces cerevisiae. Genetics. 86(1):33–55. doi: 10.1093/genetics/86.1.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rubin AJ, Parker KR, Satpathy AT, Qi Y, Wu B, Ong AJ, Mumbach MR, Ji AL, Kim DS, Cho SW, et al. 2019. Coupled single-cell CRISPR screening and epigenomic profiling reveals causal gene regulatory networks. Cell. 176(1–2):361–376.e17. doi: 10.1016/j.cell.2018.11.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schärer OD. 2013. Nucleotide excision repair in eukaryotes. Cold Spring Harb Perspect Biol. 5(10):a012609. doi: 10.1101/cshperspect.a012609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shoemaker DD, Lashkari DA, Morris D, Mittmann M, Davis RW. 1996. Quantitative phenotypic analysis of yeast deletion mutants using a highly parallel molecular bar-coding strategy. Nat Genet. 14(4):450–456. doi: 10.1038/ng1296-450. [DOI] [PubMed] [Google Scholar]
- Stevenson ZC, Moerdyk-Schauwecker MJ, Banse SA, Patel DS, Lu H, Phillips PC. 2023. High-throughput library transgenesis in Caenorhabditis elegans via Transgenic Arrays Resulting in Diversity of Integrated Sequences (TARDIS). eLife. 12:RP84831. doi: 10.7554/elife.84831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weile J, Roth FP. 2018. Multiplexed assays of variant effects contribute to a growing genotype–phenotype atlas. Hum Genet. 137(9):665–678. doi: 10.1007/s00439-018-1916-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, Andre B, Bangham R, Benito R, Boeke JD, Bussey H, et al. 1999. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science. 285(5429):901–906. doi: 10.1126/science.285.5429.901. [DOI] [PubMed] [Google Scholar]
- Wong ASL, Choi GCG, Cui CH, Pregernig G, Milani P, Adam M, Perli SD, Kazer SW, Gaillard A, Hermann M, et al. 2016. Multiplexed barcoded CRISPR–Cas9 screening enabled by CombiGEM. Proc Natl Acad Sci USA. 113(9):2544–2549. doi: 10.1073/pnas.1517883113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang J, Kobert K, Flouri T, Stamatakis A. 2014. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics. 30(5):614–620. doi: 10.1093/bioinformatics/btt593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu Q, Zhao X, Zhang Y, Li Y, Liu S, Han J, Sun Z, Wang C, Deng D, Wang S, et al. 2023. Single cell multi-omics reveal intra-cell-line heterogeneity across human cancer cell lines. Nat Commun. 14(1):8170. doi: 10.1038/s41467-023-43991-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Strains and plasmids are available upon request, and are described in Supplementary Tables 1 and 2, respectively. Supplementary Tables 1–4 are provided in Supplementary File 1, and all Supplementary Figures and Supplementary Table 5 are provided in Supplementary File 2. Code has been deposited on GitHub with a permanent repository on Zenodo available at DOI: 10.5281/zenodo.15642815. Sequencing read data used in this study is available at the Sequence Read Archive with BioProject accession PRJNA1273127.
Supplemental material available at G3 online.




