Abstract
Deep sequencing offers an unprecedented view of an organism's genome. We describe the spectrum of mutations induced by three commonly used mutagens: ethyl methanesulfonate (EMS), N-ethyl-N-nitrosourea (ENU), and ultraviolet trimethylpsoralen (UV/TMP) in the nematode Caenorhabditis elegans. Our analysis confirms the strong GC to AT transition bias of EMS. We found that ENU mainly produces A to T and T to A transversions, but also all possible transitions. We found no bias for any specific transition or transversion in the spectrum of UV/TMP-induced mutations. In 10 mutagenized strains we identified 2723 variants, of which 508 are expected to alter or disrupt gene function, including 21 nonsense mutations and 10 mutations predicted to affect mRNA splicing. This translates to an average of 50 informative mutations per strain. We also present evidence of genetic drift among laboratory wild-type strains derived from the Bristol N2 strain. We make several suggestions for best practice using massively parallel short read sequencing to ensure mutation detection.
MUTAGENESIS and the screening for mutants have long been a key tool of the practicing geneticist. The early work of T. H. Morgan and his colleagues relied on recovery of spontaneous mutations, which was limiting for the study of inheritance due to their infrequent occurrence (Morganet al. 1922; also see Sturtevant 1965). The discovery by H. J. Muller and others that X rays cause mutations ushered in the era of inducing mutations (Muller 1927). There is a long history of studies on mutagen specificity, both in prokaryotes and in eukaryotes, and today many mutagens are utilized in a variety of model organisms. In this article we use whole-genome deep sequencing in the model organism Caenorhabditis elegans to explore the types and frequencies of mutations induced by various mutagens and to document the feasibility of global identification of mutations.
The mutagenic properties of ethyl methanesulfonate (EMS) were first demonstrated using the T4 viral system (Loveless 1959). Soon after, Lewis and Bacher (1968) demonstrated how to administer EMS to Drosophila melanogaster to generate mutations, and later Sydney Brenner did the same for the nematode C. elegans (Brenner 1974). The now classic article by Coulondre and Miller (1977) demonstrated the types of nucleotide substitutions generated by EMS and confirmed earlier observations (Bautz and Freese 1960) concerning the strong bias for GC to AT transitions. Today, EMS is still the most powerful and popular mutagen used by researchers studying D. melanogaster and C. elegans. Purely on the basis of genetic inference, when used at a concentration of 50 mm, EMS is calculated to induce ∼20 function-affecting variant alleles in C. elegans strains derived using this mutagen (Greenwald and Horvitz 1982; Anderson 1995).
The chemical N-ethyl-N-nitrosourea (ENU) has been used as a mutagen since the 1970s but came to prominence when it was demonstrated to be the most effective chemical mutagen in mice (Russell et al. 1979). Today it is still the chemical mutagen of choice for this organism (Anderson 2000; Acevedo-Arozena et al. 2008). ENU has also been used for C. elegans mutagenesis (De Stasio et al. 1997). Although it appears to have different biases with regard to gene targets and base changes relative to EMS, the background mutational load after ENU mutagenesis has not been fully characterized (De Stasio and Dorman 2001).
The chemical 4,5′,8-trimethylpsoralen is a crosslinking agent that is activated by near ultraviolet light. Studies in Escherichia coli have shown that it causes both single-base changes and deletions (Piette et al. 1985; Sladek et al. 1989). C. elegans researchers became interested in the potential of ultraviolet trimethylpsoralen (UV/TMP) to generate deletions in worms after the first deletions in this organism were isolated using this mutagen (Yandell et al. 1994). UV/TMP is now a major reagent in the arsenal of the C. elegans knockout consortium laboratories (Barstead and Moerman 2006). As a tool for generating deletions in eukaryotes it is quite useful but, outside of studies on prokaryotes, little else is known about the spectrum of mutagenic effects caused by UV/TMP.
Massively parallel short read sequencing technologies offer unprecedented opportunities to study the complete genetic complement of an individual organism (Hillier et al. 2008). For genetic model systems the impact of this technology extends to the identification and correlation of induced mutations with selected phenotypes (Sarin et al. 2008). Several of the technological and bioinformatic issues that arise with next generation sequencing have already been addressed for the nematode C. elegans (Hillier et al. 2008; Sarin et al. 2008; Shen et al. 2008; Rose et al. 2010). Still, it is not clear how deeply one must sequence to confidently identify a relevant variant allele in a target mutant strain. Also of importance are questions concerning mutagen choice and dosage as they relate to the rate of induction of new mutations and background mutational load. We have undertaken the following study on mutagenesis and mutation detection to establish the parameters necessary to exploit next generation sequencing technologies for C. elegans genetics. For the first time we offer a whole-genome direct measure of mutation spectrum and background load for EMS, ENU, and UV/TMP. Readers interested in whole-genome sequencing of EMS mutagenized strains in C. elegans should also see the accompanying article in this issue by Sarin et al. (2010). In our study we also measured the single-nucleotide variation among currently used wild-type strains. In addition, we measured sequence read depth of all sequence and coding sequence and from this we make a recommendation of average genome coverage to ensure the correct identification of the causative mutation. We also examined the issue of false positive and false negative calls and make recommendations to eliminate most false positives without losing bona fide mutations.
MATERIALS AND METHODS
Strains used in this study:
Homozygous unc-22 strains derived from mutagenesis were VC1923, VC1924, RB5000, and RB5002 (EMS); VC2362 and VC2366 (UV/TMP); and VC2451, VC2452, and RB5001 (ENU). DM1017 [unc-52(e3003e998) II; sup-38(ra5) IV] was generated by EMS mutagenesis of CB998 (Gilchrist and Moerman 1992). VC2010 is the Moerman Gene Knockout Lab subculture of N2 obtained from the Caenorhabditis Genetics Center in 2002 and is the parent strain for all unc-22 strains described here. A second N2 subculture, from the Schedl lab at Washington University (St. Louis, MO), designated WU N2, was also used in this work, but we used existing whole-genome sequence data (Hillier et al. 2008) instead of generating new data. The N2 reference DNA sequence used in this study was largely derived from cosmids constructed from a wild-type strain, although a portion of the reference was derived from yeast artificial chromosomes (YACs), which were produced from the strain CB1301, an endonuclease minus mutant derived from a wild-type strain by EMS mutagenesis (R. Waterston, personal communication).
Nematode culture and mutagenesis:
Nematodes generally were grown as previously described (Brenner 1974). Populations for mutagenesis were prepared by washing VC2010 (N2) animals off starved plates with 10-ml aliquots of M9 buffer containing 0.01% Triton X-100, pelleting in a benchtop centrifuge in 15-ml centrifuge tubes, and replating on 150-mm NGM agar plates seeded with E. coli strain OP50 or χ1666. After 2 days at 20°, gravid adults were harvested by washing with sterile distilled water and the population was synchronized by alkaline hypochlorite treatment (Sulston and Hodgkin 1988), and the egg pellet was replated on fresh 150-mm seeded plates. When the populations were predominantly at the L4 stage, they were collected for mutagenesis by washing in M9/Triton X-100.
Mutagenesis with EMS was performed at 50 mm (for VC and DM strains) or 60 mm (for RB strains) according to standard protocols (Sulston and Hodgkin 1988). Mutagenesis by UV/TMP was performed at 2 μg/ml TMP according to our laboratory standard protocol. Briefly, 100 mg TMP (Sigma, St. Louis; T6137) was dissolved by vigorous shaking in 40 ml acetone (Gengyo-Ando and Mitani 2000) in a sterile 50-ml polypropylene centrifuge tube to make a solution of 2.5 μg/μl. A nematode population washed from a growth plate was transferred to a sterile 15-ml polypropylene centrifuge tube and the volume adjusted to 4 ml. TMP solution was added to the worm suspension to the desired concentration. The tubes were incubated in darkness on a benchtop shaker at 150 rpm for 1 hr. After incubation the worms were washed free of mutagen and bacteria in five changes of M9/Triton X-100 and diluted to 12.5 ml in M9/Triton. Twelve 1-ml aliquots were placed in individual wells of a sterile untreated 12-well tissue culture plate (Biopacific, E222804601F), and the plate was irradiated with 360 nm UV at 340 μW/cm2 for 90 sec in a custom-built irradiation cabinet. Mutagenesis by ENU was performed at 0.5 mm according to De Stasio and Dorman (2001) for the VC strains and at 1.0 mm for the RB strains. It should be noted that batches of mutagen do vary in potency. In our experience TMP is particularly susceptible to batch variability, which makes it very difficult to arrive at a standard recommended concentration.
For each mutagen, 180 P0 animals were plated at three per 60-mm seeded agar plate. The F1 progeny were screened for heterozygous unc-22 twitchers in 1% nicotine (Moerman and Baillie 1979); from each such animal, a homozygous F2 unc-22 twitcher was selected for further processing. Each homozygous line was propagated clonally from F2 through F7 to drive other carried mutations to homozygosity, and a homozygous F7 animal was selected from each to establish a stock. Between two and four such F7 lines were selected for each mutagen. The VC and DM lines were subjected to comparative genomic hybridization (CGH) analysis using our whole-genome chip designs (Roche NimbleGen design name “2006-11-16_CE2_WG_CGH_E” for the EMS mutants and “081002_CE_UBC_RZ_CGH” for the TMP and ENU mutants) and to whole-genome sequencing on the Illumina platform at the Michael Smith Genome Sciences Centre (Vancouver, BC, Canada). The RB lines were not examined with CGH but were sequenced using the Illumina platform at the Oklahoma Medical Research Foundation (OMRF) (Oklahoma City, OK).
Whole-genome shotgun sequencing:
Worm strains for sequencing were grown on 150-mm petri plates containing rich NGM medium (standard recipe but 8× peptone) with agarose [Invitrogen (Carlsbad, CA) UltraPure, catalog no. 16500-500] substituted for agar, seeded with E. coli χ1666. Populations were grown to starvation, harvested by washing with sterile M9/Triton X-100, and pelleted by centrifugation in sterile 15-ml centrifuge tubes, and the supernatant was removed by aspiration. The buffer was removed using five changes of sterile distilled water, centrifugation, and aspiration; after the final wash, water was removed, leaving a concentrated worm pellet in a minimum volume, typically between 300 and 500 μl. Worm concentrates were frozen at −80°.
Two methods were used for DNA preparation. DNA for strains VC2010 and VC1924 was prepared by standard phenol/chloroform extraction with RNAse treatment, precipitation, and resuspension in TE (pH 8.0). DNA for all other strains was prepared using the PureGene Genomic DNA Tissue Kit [QIAGEN (Valencia, CA) no. 158622], following a supplementary QIAGEN protocol for nematodes. DNA concentrations were determined using a Thermo Fisher Nanodrop spectrophotometer, and quality was checked by electrophoresis on 1% agarose gels.
Prepared DNAs were submitted to the Michael Smith Genome Sciences Centre (VC and DM strains) or the OMRF sequencing center (RB strains) in amounts ranging from 5 to 10 μg. The WU N2 strain, which was derived from the Bristol N2 strain, was sequenced by Hillier et al. (2008) with unpaired reads of 32 nucleotides in length while the VC2010 and VC1924 strains were sequenced with 36-mer unpaired reads. The other strains were sequenced with a more recent protocol using paired reads, with a read length of 37 for the RB strains (RB5000, RB5001, and RB5002) and 50 for all the remaining strains.
Sequence alignment, variant determination, and search for deletions:
The sequence reads were aligned to the WormBase (www.wormbase.org) reference genome version WS190, using the Maq software suite version 0.7.1 (Li et al. 2008) with default parameters. The variant calling procedure also followed a standard Maq pipeline with default filters and parameters except that the maximum number of mismatches was set to three and the minimum mapping quality was set to 10 for a read to be considered in the consensus calling. Identical read pairs were removed before variant determination and only the unambiguous homozygous variants were kept. Variants with ambiguous consensus base (i.e., base other than A, C, T, or G) were eliminated. Furthermore, the minimum Phred-like consensus quality was set to 30 instead of the default 20, which provided a better compromise between sensitivity and specificity for the current application and data sets. The variant candidates were then labeled according to their type and location (within intron or exon, synonymous or nonsynonymous, etc.) with a custom Perl script, using gene information available in WS190.
Searches for homozygous deletions >10 bases followed a two-step approach. First the regions with no coverage in the mutant strain under consideration but with some coverage in the N2 strain VC2010 were flagged for further analysis with the second step. The second step compared the distribution of apparent distance between the aligned read pairs in the regions of interest to the average distribution for the whole chromosome, using a Kolmogorov–Smirnov (KS) test. Those two steps were implemented with Perl and R scripts and the most promising deletion candidates were further evaluated by visual inspection. The most promising candidates were the regions with a small P-value in the KS test from the second step, larger size, and good agreement between the two size measurements, i.e., (1) the size of the region with lack of coverage from the first step and (2) the increase in apparent insert size from the second step.
Comparative genomic hybridization:
The VC and DM mutant strains were processed with whole-genome comparative genomic hybridization following the procedure described in detail in Maydan et al. (2007), using the N2 wild-type VC2010 strain for reference DNA. Roche NimbleGen manufactured the microarrays but the experiments and subsequent data processing were performed in the laboratory of D. G. Moerman.
RESULTS
Variation in wild-type strains:
We began our study by analyzing our laboratory wild-type strain, VC2010, and comparing it to the sequences of two other wild-type strains, the C. elegans reference sequence and the recently reported sequence of the WU N2 strain (Hillier et al. 2008). All three sequenced strains are subcultures of the original N2 strain established by Brenner (1974), but were maintained separately for an unknown number of generations before sequencing. Thus, an uncertain amount of drift may have occurred. For this project we used the massively parallel short sequencing technology from Illumina (either a GA1 or a GA2 instrument). In addition, we also reanalyzed sequencing reads from the WU N2 strain from Washington University (Hillier et al. 2008). Sequencing reads were aligned using the Maq software suite (Li et al. 2008) to the wild-type Bristol N2 reference genome WS190 assembled from Sanger sequencing reads (C. elegans Sequencing Consortium 1998). Sequence processing details can be found in materials and methods. Variant candidates (single-nucleotide transitions and transversions) were called with the Maq suite primarily using default parameters and filters.
After obtaining an average sequence depth of coverage of 17-fold (Figure 1), we found 871 differences between the derived sequence of VC2010 and the reference sequence. Similar to the results reported by Hillier et al. (2008), in our reanalysis of their data (32-fold coverage), we detected 844 substitutions compared to the wild-type reference genome. Of these, 634 of the differences are shared between the two strains (Figure 2). These could represent accumulated mutations depending on the relationship of these two strains to the strains used for the reference sequence; more likely they represent errors in the reference sequence. Assuming that they all correspond to sequencing errors and that ;no others exist, we can estimate the sequencing error rate of the reference genome to a little less than 1 in 100,000 bases, which is in good agreement with previous estimates (C. elegans Sequencing Consortium 1998). Of the variants unique to each strain, almost all are in noncoding sequences or in silent bases of codons (synonymous changes). Some of these variants may represent sequencing errors, whereas others may represent difference accumulated due to drift during maintenance of the two strains. We return to this question after consideration of the mutagen-treated strains.
Figure 1.—
Average genome coverage for the various strains studied in this work. The average coverage is calculated by dividing the sum of the aligned bases by the genome size. The boxes are color coded according to the mutagens used to create the strains and the color key is provided in the inset.
Figure 2.—
Number of variants called in two N2 wild-type strains (VC2010, Vancouver; WU N2, St. Louis) when the sequence of each is compared to the Cambridge wild-type reference genome WS190 and to each other. The blue bars correspond to variants common to VC2010 and WU N2 but different from the reference genome. The yellow plus red bars represent variants unique to either VC2010 or WU N2. The red bars are associated with nonsynonymous variants. The yellow bars are associated with synonymous variants.
Variant detection in EMS-, ENU-, and TMP/UV-treated strains.
We next sequenced at varying depths of coverage (Figure 1) 10 strains derived from VC2010 after treatment with three commonly used mutagens using standard protocols. Five strains were produced with EMS, two with UV/TMP, and three with ENU (see materials and methods). As can be seen in Figure 1, in the current study the average genomic coverage for each strain ranged from 12× to 33× with a median value of 19×. Variant candidates (single-nucleotide transitions and transversions) were called with the Maq suite primarily using default parameters and filters. Since the default setting of this software requires a minimum of three reads to detect a variant, we evaluated the fraction of the genome covered to at least this depth (Figure 3). Even for the strains with the lowest overall coverage, >90% of the genomic bases were covered to a depth of at least three reads. Because coverage can be influenced by GC content and exons are GC rich compared to the rest of the genome, and also because we are primarily interested in variants in exons, we determined the fraction of exon bases with at least three reads and the fraction of exons with all bases covered by at least three reads (Figures 3 and 4). The exon coverage closely paralleled the overall genome coverage, but Figure 4 does illustrate that an overall coverage >20× is necessary to detect all the variants in more than 80 of the exons.
Figure 3.—
Percentage of bases covered at a depth of at least 3× for each strain. The red bars correspond to the coverage for all the bases in the genome and the yellow bars to the bases within exons. The blue bars represent the percentage of exons for which all the bases are covered to a depth of at least 3×.
Figure 4.—
Percentage of exons totally covered to a depth of at least 3× as a function of average genome-wide coverage. Each data point corresponds to a different strain studied in this work.
The genomes of each of these mutagenized strains have a large number of mutations compared to the reference. Many of these differences are shared with the VC2010 parent strain or with other mutagenized strains (Figure 5). Because mutations are rare in each of these strains, the variants shared between strains could correspond to additional errors in the reference genome (see above), errors produced by some sort of bias in the current sequencing protocol, or genetic drift between VC2010 and the strain(s) used to create the reference genome (see below). Before analyzing the variants unique to each strain in more detail, we wished to evaluate the sensitivity and specificity of the variant calls.
Figure 5.—
Number of variants identified in the wild-type N2 strain VC2010 and the 10 mutated strains studied in this work. The boxes are color coded according to the mutagens used to create the stains and the color key is provided at the top left. The hatched bars correspond to the variants present in >1 of those 11 strains while the solid bars correspond to the unique variants.
Sensitivity and specificity analysis:
We used two approaches to evaluate sensitivity. If we assume that the variants present in both wild-type N2 strains (VC2010 and WU N2) represent true differences from the reference genome, we can use these as a “true” positive set to estimate the ability to detect these in each of the mutant strains. Sensitivity ranged from 79 to 96% for individual mutant strains, with a median of 90% and a strong correlation to the average genome coverage (correlation coefficient r = 0.865; see Table 1). When focusing such analysis only on variants present in exons, the median sensitivity increases to ∼95%.
TABLE 1.
Variant detection sensitivity as a function of the threshold in Phred-like quality score
Strain | Coverage | Sensitivity (%)quality >20 | Sensitivity (%)quality >30 | Sensitivity (%)quality >40 |
---|---|---|---|---|
VC2366 | 11.9× | 78.6 | 78.5 | 64.5 |
VC2362 | 12.7× | 83.8 | 84.2 | 76.7 |
VC1923 | 12.8× | 81.7 | 81.9 | 73.7 |
RB5000 | 14.0× | 88.7 | 89.3 | 87.6 |
DM1017 | 16.0× | 87.5 | 87.5 | 82.9 |
VC2451 | 20.3× | 90.8 | 91.2 | 87.4 |
VC2452 | 23.0× | 90.4 | 91.2 | 87.6 |
VC1924 | 23.4× | 95.1 | 95.7 | 93.2 |
RB5002 | 29.1× | 94.7 | 95.1 | 95.1 |
RB5001 | 30.1× | 93.4 | 93.9 | 94.2 |
The variant detection sensitivity was evaluated for individual strains using three different thresholds in Phred-like quality score from Maq. Variants common to VC2010 and WU N2 were used for the evaluation, and the list comprised 655 variants at quality >20, 634 variants at quality >30, and 533 variants at quality >40.
The above sensitivity calculations are based on a biased data set; they derive only from regions of the genome where variants have already been detected in this study. We therefore estimated the sensitivity in a second way. We created 10,000 in silico single-nucleotide “mutations” at random locations in the reference genome and calculated our ability to detect these with real reads from our paired-end 50-mer data sets. At a depth of 20×, the overall sensitivity to detect the in silico mutations was 89% and reached 95% when focusing on exons.
Detection specificity is the proportion of the variants called that are true differences from the reference genome. This requires comprehensive knowledge of the true positives in a data set. Specificity here can be estimated by analyzing the variants called for VC2010 provided two assumptions are made: (1) variants called in VC2010 but none of the derived strains are all false positives and (2) none of the remaining VC2010 variants (those called in multiple strains) are false positives. Under these two assumptions, the specificity for calling variants in the current VC2010 data set is 90%. Changing the minimum Phred-like consensus quality from the default of 20 to 30 improved the specificity from 77 to 90%, without significant loss in sensitivity (Table 1). A threshold consensus quality of 40 increases the specificity to 99%, but at a cost in sensitivity (Table 1). To evaluate specificity directly, we tested 40 predicted variants in the strain VC1924 by PCR of the affected regions and subsequent Sanger sequencing and confirmed 39 of them. The remaining one was not a single-base substitution but a single-base insertion also present in VC2010. On the basis of this sensitivity/specificity analysis the variants labeled as unique in Figure 5 are likely to represent the bulk of induced variants present in each strain with only a few false positives. Therefore we focused our further analysis on those candidates.
The mutation spectrum of EMS, ENU, and TMP/UV:
On the basis of the unique variants (Figure 5) EMS produced more variants (median of 323 per strain) than ENU (median of 226) or UV/TMP (median of 78) at the concentrations used. As expected, the relative frequency of the possible transitions and transversions differed for each mutagen. UV/TMP produced the least amount of bias between the various mutation types; EMS primarily produced G/C to A/T transitions; and ENU produced A/T to T/A transversions, G/C to A/T transitions, and A/T to G/C transitions (Figure 6).
Figure 6.—
Relative frequency of the various transitions and transversions identified in the 10 mutated strains studied in this work. The data for individual strains were combined according to the mutagen used. Only the variants appearing in a single strain were used. The color key identifying the type of mutation is provided in the inset, and the first letter represents the reference nucleotide while the second letter represents the mutated nucleotide.
These mutations appear to be distributed differently across gene features for the different mutagens (Figure 7). The proportion of variants within exons was the smallest for the UV/TMP strains and the largest for the EMS strains. For those mutations falling in exons, we found 477 missense and 21 nonsense mutations in the 10 mutant strains derived from the VC2010 wild-type strain. In addition, we found a total of 10 variants affecting splice sites (see Table 2). On average we identified 50 informative mutations per strain. The complete list of variants called in the mutant strains is available for download (see supporting information, File S1).
Figure 7.—
Relative variant frequency affecting various gene features in the 10 mutated strains studied in this work. The data for individual strains were combined according to the mutagen used. Only the variants appearing in a single strain are represented. The proportions of nonsynonymous variants are represented in blue (dark for nonsense and light for missense mutations), the synonymous variants within exons are shown in green, the variants appearing in introns are yellow, and red bars are used for the remaining variants.
TABLE 2.
Variants associated with nonsense mutations and splicing defects in the mutant strains sequenced in the current study
Strain | Allele | Chromosome | CoordinateWS190 | Referencebase | Mutantbase | Quality | Mutation | Gene |
---|---|---|---|---|---|---|---|---|
DM1017 | gk2895 | III | 10004481 | G | T | 81 | Splicing | C05B5.11 |
RB5001 | ok5304 | III | 7783172 | C | T | 57 | Splicing | C08C3.3 |
RB5001 | ok5478 | X | 15475897 | A | C | 135 | Nonsense | C46E1.3 |
VC1923 | gk1398 | II | 3721382 | G | A | 48 | Splicing | C46E10.3 |
RB5002 | ok5778 | V | 10149631 | G | A | 83 | Splicing | C51E3.1 |
VC1924 | gk963 | I | 8430347 | C | T | 111 | Splicing | F02E9.9 |
RB5002 | ok5863 | X | 1312996 | C | T | 95 | Nonsense | H42K12.3 |
DM1017 | gk2864 | II | 14421096 | T | A | 54 | Nonsense | K04B12.1 |
VC2452 | gk2668 | V | 16180639 | G | A | 111 | Nonsense | K05D4.2 |
DM1017 | gk2856 | II | 14115235 | G | A | 57 | Nonsense | K09E4.4 |
RB5000 | ok5083 | III | 2036666 | C | T | 90 | Splicing | M01E10.2 |
VC2451 | gk2294 | I | 4765803 | G | T | 39 | Nonsense | M04F3.2 |
RB5002 | ok5810 | V | 13368760 | G | A | 126 | Nonsense | F32H5.6 |
RB5001 | ok5385 | V | 1846700 | A | T | 105 | Nonsense | T10B5.7 |
RB5002 | ok5804 | V | 12945577 | G | A | 105 | Nonsense | T16G1.8 |
VC1924 | gk964 | II | 8930558 | G | A | 74 | Splicing | T21B10.1 |
RB5002 | ok5573 | II | 3779342 | C | T | 108 | Splicing | T24E12.11 |
RB5002 | ok5516 | I | 5111435 | G | A | 84 | Nonsense | Y110A7A.15 |
VC1924 | gk2021 | V | 18843877 | C | T | 81 | Nonsense | Y17D7A.3 |
RB5000 | ok5041 | II | 3107233 | A | T | 57 | Nonsense | Y25C1A.3 |
VC1923 | gk1442 | II | 12590593 | G | A | 36 | Nonsense | Y38E10A.6 |
DM1017 | gk2714 | I | 93554 | C | T | 126 | Nonsense | Y48G1C.1 |
RB5002 | ok5820 | V | 15195283 | G | A | 153 | Nonsense | Y75B12B.6 |
DM1017 | e998 | II | 14657282 | C | T | 57 | Nonsense | ZC101.2 |
RB5001 | ok5212 | I | 10356594 | T | C | 126 | Splicing | ZC434.9 |
VC1924 | gk952 | III | 9536643 | G | T | 57 | Nonsense | ZK1098.3 |
VC2451 | gk2406 | IV | 11978396 | G | A | 102 | Nonsense | ZK617.1 |
VC1924 | gk965 | IV | 11982077 | G | A | 90 | Nonsense | ZK617.1 |
VC2452 | gk2609 | IV | 11992038 | G | A | 135 | Nonsense | ZK617.1 |
VC2452 | gk2548 | II | 10497733 | T | G | 108 | Splicing | F59B10.1 |
DM1017 | gk2764 | I | 4371101 | G | A | 36 | Nonsense | ZK973.9 |
Deletion mutations in the mutagen-treated strains:
Because Maq as used here does not detect insertion/deletion differences, we searched independently for large homozygous deletions in the sequence data sets produced with the paired end sequencing protocol (see materials and methods) and complemented the searches with CGH experiments. With the microarray designs used in the current study, CGH has previously been capable of detecting deletions up to megabases in size and mutations as small as single-base changes (Maydan et al. 2007, 2009). Only two deletion candidates were found in our mutagenized strains and these were confirmed using Sanger sequencing. The first deletion removes 122 bp of an exon of the unc-22 gene in the UV/TMP strain VC2366, coincidentally very close to a T to G point mutation in the same exon (Figure 8). The other deletion, ∼200 bases in size, affected an intron in the gene arx-1 (Y71F9AL.16) in the ENU strain VC2452.
Figure 8.—
Deletion found on chromosome IV of the strain VC2366. The top panel shows the coverage in the region of interest. The middle panel shows the apparent distance between the alignments of read pairs to the reference genome. The bottom panel shows the fluorescence ratio in log2 scale in a comparative genomic hybridization experiment (Maydan et al. 2007).
Genetic drift in wild-type strains:
The availability of whole-genome sequences for three separate “N2-like” wild-type strains, one from Cambridge, England, one from St Louis, and one from Vancouver, British Columbia, provides an opportunity to examine genetic drift. Since the sensitivity is not perfect, we cannot expect all variants to be called in all the data sets. To estimate the amount of genetic drift between VC2010 and WU N2 we concentrated on the 886 variants called in at least six (i.e., more than half) of the VC2010-derived strains (Figures 2 and 5). Assuming a sensitivity of 90% and no genetic drift, we calculate that we should detect 797 of those variants in the WU N2 sequence. However, we called only 662 of those variants. The difference between these two numbers, 135, is the estimate of the number of variants present in VC2010 but absent in WU N2. Further, there were 85 variants in the WU N2 that were not found in VC2010 or its derivatives. Since both the sensitivity and the specificity are estimated to be ∼90%, we can use that number of 85 variants directly as an estimate of the variants really present in WU N2 but not in VC2010. Taken together, 135 + 85 gives us a reasonable estimate of the genetic drift between VC2010 and WU N2.
DISCUSSION
In this study we determined the spectrum of mutations across the entire C. elegans genome induced by three well-known mutagens EMS, ENU, and UV/TMP. In total we detected 2723 variants and we placed these into two broad categories. In the first group, comprising 2215 variants, are those that do not affect any encoded protein. These variants map to intergenic regions, or introns, or lead to synonymous change. In the second group, comprising 508 variants, are those producing nonsynonymous changes (477 events), nonsense mutations (21 events), and mutations affecting splice sites (10 events). As we endeavored to make each animal homozygous across the entire genome prior to sequencing, our study probably detects only about half of the genetic events after mutagenesis. Even so, it is clear that each mutagenized animal carries on average a genetic load of hundreds of variants of which as many as 50 may have deleterious effects. This translates to about eight mutations on every chromosome in the worm. There are two implications of this observation. First, it emphasizes the need to outcross a mutant animal with a particular phenotype several times to ensure that all, or at least most, background mutations are eliminated. Second, for those wishing to identify the mutation responsible for a given phenotype, simply sequencing the mutant genome is not enough. Some genetic mapping should be done concurrently or one will be confounded by all the choices in candidate variants. A number of powerful new SNP, bulk segregant, or CGH mapping methods are available where traits can be mapped to a 5-map-unit interval, or even a smaller interval, after a single cross (Michelmore et al. 1991; Jakubowski and Kornfeld 1999; Wicks et al. 2001; Swan et al. 2002; Davis et al. 2005; Flibotte et al. 2009). In combination with deep sequencing these methods will provide an unprecedented resource for new forward genetics discoveries.
For EMS and ENU the spectrum of mutation events we observe is similar to what has been reported by others (Coulondre and Miller 1977; De Stasio et al. 1997). EMS produces primarily G/C to A/T transitions and ENU, while exhibiting some bias for these transitions, also produced many other transitions and transversions (Figure 6). UV/TMP demonstrated the least mutation bias, producing all manner of transitions and transversions (Figure 6). This was unexpected, as previous investigators reported a bias using this mutagen for TA to GC transversions (Piette et al. 1985). It should be noted, however, that this earlier study was performed on a single locus and sampled only a small number of mutations. The GC content within exons (43%) is much higher than the overall GC content (35%) of the C. elegans genome. It is therefore expected that the strong EMS bias to produce G to A and C to T transitions would translate into an increase in the proportion of variants affecting exons relative to UV/TMP as can be seen in Figure 7. Following mutagenesis procedures and doses commonly used by the C. elegans research community (see materials and methods), we find that EMS is the most powerful mutagen causing 1.5- to 2-fold more base substitutions than ENU and at least 3-fold more mutations than UV/TMP. Of striking note, none of 31 nonsense or splice site variants identified were isolated after UV/TMP mutagenesis (Table 2): all were identified after either EMS or ENU mutagenesis. As each mutagen is distinct in behavior, choice of mutagen will be determined by what type of mutation a researcher desires. If the object is to obtain null alleles, EMS or possibly ENU is an excellent choice. On the other hand if one is seeking allelic variability, including a variety of missense alleles, ENU may be a wiser choice because it will generate a broader spectrum of changes at high frequency.
From our study and those of others (Hillier et al. 2008; Denver et al. 2009; Vergara et al. 2009) it is clear that not all wild-type N2 subcultures are identical. Denver et al. (2009) estimated that the spontaneous forward mutation rate for C. elegans is 3 × 10−9 per site per generation. This translates to about one mutation every three generations. We identified hundreds of variants unique to our N2 strain or the WU N2 strain. Using a very conservative metric, we find that ∼135 variants are unique to VC2010 and another 85 are unique to the WU wild-type strain. On the basis of the above calculation of the occurrence of spontaneous mutations and the time separating these wild-type strains, these variants are most likely the result of genetic drift that has occurred over the hundreds of generations these two strains have been separated, both from each other and from the reference Bristol N2 wild-type strain. This observation leads us to recommend that investigators sequence the parental strains used in any forward genetic screen to distinguish induced variation from variation due to drift. Granted, this could get expensive if all one's mutations are in such different parental backgrounds that one has to sequence a different parent strain for every mutant. An alternative would be to sequence a second allele of the unknown gene and subtract the common background variants (for details on this approach see Sarin et al. 2010). Of course the limitation here is that a second allele will not always be available.
In this deep sequencing study of the 100-Mb C. elegans genome we have tried to determine the most appropriate parameters to maximize sensitivity and specificity of variant detection without incurring undo cost. We explored base and exon coverage after varying sequencing depths (Figures 1, 3, and 4 and Table 1). We incorporated the idea of not just measuring genome coverage, but also measuring total exon coverage as this proved to be a better and more sensitive indicator of how well all exons of genes are represented in the sequence. Here we define the total exon coverage as the percentage of all exons for which all the bases are covered to a depth of at least three reads. Coverage much below 15× generally resulted in sensitivity too low to ensure detecting a mutation. As a case in point, we were unable to find the unc-22 mutation in strains VC1923 and VC2362 even though all VC strains reported here carry mutations in the unc-22 gene (see materials and methods for selection procedure after mutagenesis). For strains VC1923 and VC2362 the genome coverage was ∼13-fold and at this level exons in genes were not adequately represented (Figures 3 and 4). In all strains with deeper coverage all unc-22 exons were covered and we could identify the mutation. On the basis of our experience with various sequencing depths we recommend aiming for 25× genome coverage to determine a homozygous mutation, as this would give close to 100% coverage of all exons to a depth of three reads. At this sequencing depth the variant detection sensitivity will be ∼95% (Table 1).
Both read depth and Phred-like quality cutoffs have an impact on sensitivity and specificity. The default setting for the Phred-like consensus score in the Maq program is 20. By shifting this to 30 one gains improved specificity with only a marginal loss of sensitivity, but if one shifts the setting any higher, then the loss in sensitivity is significant (see Table 1). We examined a limited set of 40 variant calls in the strain VC1924 by Sanger sequencing and only one “false positive” was identified, a single-base insertion. Calling this type of feature accurately is a known weakness of the Maq aligner (Krawitz et al. 2010) especially when working with shorter 35-base unpaired reads like those used to analyze VC1924. For most of our samples we used a sequencing protocol with paired-end reads. New analysis algorithms and software associated with massively parallel sequencing technologies become available on a regular basis. Doing an extensive review of such analysis techniques is clearly beyond the scope of the current work. However, the nature of the false positive variant called in VC1924 suggests that one could benefit from using an algorithm capable of producing gapped alignments. To test this hypothesis, we reanalyzed all the sequencing data sets used in the current work with the BWA/Samtools combination (Li and Durbin 2009; Li et al. 2009). The new analysis program correctly identified the single-base insertion relative to the reference genome in all 12 data sets, including in both wild-type strains (our unpublished results).
One of the goals of this study was to establish some guidelines for future genome analysis projects, as it is clear that next generation sequencing is an important addition to the tools used to correlate phenotype and genotype. The parameters we suggest above will become easier to attain as sequencing costs continue to drop and as new programs for analysis become available. Data quality is important for any deep sequencing project—even more so when the end product is a persistent genetic reagent like a C. elegans strain. One must ensure that the reported sequence truly represents the archived strain. Discrepancies that arise due to poor sequence or poor strain maintenance may be fatal to future genome-scaled projects that seek to correlate phenotypes with gene networks. Poor sequence can result from bad reactions, poor data analysis, or inconsistent/incomplete data annotation. We hope that the suggestions we make here will help others to avoid at least some of these possible pitfalls.
Even with sequence perfectly representing the genome, if the strains are not managed properly, the reported sequence will not represent the archived strain. For example, if the investigator does not freeze the strain for several months after the DNA is prepared for deep sequencing, then the archived strain will almost certainly accumulate variation via new mutation and drift as described above for our N2 subculture, VC2010. More significantly, if the investigator is not careful to drive the strain to homozygosity before the sequencing sample is prepared, the archived strain will lose variation through simple genetic segregation as it is propagated prior to freezing. In this study we self-crossed animals for 7 generations, but it would be even better to do 10 (a policy that we ourselves have adopted).
Massively parallel sequencing has the potential to completely change the genetic research landscape. For the first time an investigator can examine the full genotype of an organism. This will facilitate the study of phenotypes that are difficult to score, because they require specialized assays, are incompletely penetrant, or are variably expressed. Further, with a sufficiently large collection of mutated strains, where each gene is represented by multiple alleles, one could identify directly the genes responsible for a given screening phenotype simply by identifying the common gene targets among the relevant strains. Research on multicellular organisms has long lagged behind that on single-cell organisms like the yeast Saccharomyces cerevisiae in the study of genetic networks (reviewed in Dixon et al. 2009). Whole-genome sequencing should help close the gap and offers potentially new ways to study gene networks.
Acknowledgments
We thank Bob Waterston for advice and many helpful comments and Mark Johnston for editorial comments. We also thank Oliver Hobert for communication of data in advance of publication. We thank the British Columbia Cancer Association Genome Sciences Centre Functional Genomics Group for expert technical assistance in library construction and sequencing. This research was supported by grants from Genome Canada and Genome British Columbia, the Michael Smith Research Foundation, and the Canadian Institute for Health Research to D.G.M. and by National Institutes of Health National Human Genome Research Institute grant P41HG003652 to R.B. M.A.M. and S.J.J. are Scholars of the Michael Smith Foundation for Health Research. D.G.M. is a Fellow of the Canadian Institute for Advanced Research.
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.110.116616/DC1.
References
- Acevedo-Arozena, A., S. Wells, P. Potter, M. Kelly, R. D. Cox et al., 2008. ENU mutagenesis, a way forward to understand gene function. Annu. Rev. Genomics Hum. Genet. 9 49–69. [DOI] [PubMed] [Google Scholar]
- Anderson, K. V., 2000. Finding the genes that direct mammalian development: ENU mutagenesis in the mouse. Trends Genet. 16 99–102. [DOI] [PubMed] [Google Scholar]
- Anderson, P., 1995. Mutagenesis. Methods Cell Biol. 48 31–58. [PubMed] [Google Scholar]
- Barstead, R. J., and D. G. Moerman, 2006. C. elegans deletion mutant screening. Methods Mol. Biol. 351 51–58. [DOI] [PubMed] [Google Scholar]
- Bautz, E., and E. Freese, 1960. On the mutagenic effects of alkylating agents. Proc. Natl. Acad. Sci. USA. 46 1585–1594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brenner, S., 1974. The genetics of Caenorhabditis elegans. Genetics 77 71–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- C. elegans Sequencing Consortium, 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282 2012–2018. [DOI] [PubMed] [Google Scholar]
- Coulondre, C., and J. H. Miller, 1977. Genetic studies of the lac repressor. III. Additional correlation of mutational sites with specific amino acid residues. J. Mol. Biol. 117 525–567. [DOI] [PubMed] [Google Scholar]
- Davis, M. W., M. Hammarlund, T. Harrach, P. Hullett, S. Olsen et al., 2005. Rapid single nucleotide polymorphism mapping in C. elegans. BMC Genomics 6 118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Denver, D. R., P. C. Dolan, L. J. Wilhelm, W. Sung, J. I. Lucas-Lledo et al., 2009. A genome-wide view of Caenorhabditis elegans base-substitution mutation processes. Proc. Natl. Acad. Sci. USA 106 16310–16314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Stasio, E. A., and S. Dorman, 2001. Optimization of ENU mutagenesis of Caenorhabditis elegans. Mutat. Res. 495 81–88. [DOI] [PubMed] [Google Scholar]
- De Stasio, E., C. Lephoto, L. Azuma, C. Holst, D. Stanislaus et al., 1997. Characterization of revertants of unc-93(e1500) in Caenorhabditis elegans induced by N-ethyl-N-nitrosourea. Genetics 147 597–608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dixon, S. J., M. Costanzo, A. Baryshnikova, B. Andrews and C. Boone, 2009. Systematic mapping of genetic interaction networks. Annu. Rev. Genet. 43 601–625. [DOI] [PubMed] [Google Scholar]
- Flibotte, S., M. L. Edgley, J. Maydan, J. Taylor, R. Zapf et al., 2009. Rapid high resolution single nucleotide polymorphism-comparative genome hybridization mapping in Caenorhabditis elegans. Genetics 181 33–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gengyo-Ando, K., and S. Mitani, 2000. Characterization of mutations induced by ethyl methanesulfonate, UV, and trimethylpsoralen in the nematode Caenorhabditis elegans. Biochem. Biophys. Res. Commun. 269 64–69. [DOI] [PubMed] [Google Scholar]
- Gilchrist, E. J., and D. G. Moerman, 1992. Mutations in the sup-38 gene of Caenorhabditis elegans suppress muscle-attachment defects in unc-52 mutants. Genetics 132 431–442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Greenwald, I. S., and H. R. Horvitz, 1982. Dominant suppressors of a muscle mutant define an essential gene of Caenorhabditis elegans. Genetics 101 211–225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hillier, L. W., G. T. Marth, A. R. Quinlan, D. Dooling, G. Fewell et al., 2008. Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods 5 183–188. [DOI] [PubMed] [Google Scholar]
- Jakubowski, J., and K. Kornfeld, 1999. A local, high-density, single-nucleotide polymorphism map used to clone Caenorhabditis elegans cdf-1. Genetics 153 743–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krawitz, P., C. Rodelsperger, M. Jager, L. Jostins, S. Bauer et al., 2010. Microindel detection in short-read sequence data. Bioinformatics 26 722–729. [DOI] [PubMed] [Google Scholar]
- Lewis, E. B., and F. Bacher, 1968. Methods of feeding ethyl methane solfonate (EMS) to Drosophila. Dros. Inf. Serv. 43 193. [Google Scholar]
- Li, H., and R. Durbin, 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25 1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, H., J. Ruan and R. Durbin, 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18 1851–1858. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan et al., 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loveless, A., 1959. The influence of radiomimetic substances on deoxyribonucleic acid synthesis and function studied in Escherichia coli/phage systems. III. Proc. R. Soc. Lond. B Biol. Sci. 150 497–508. [DOI] [PubMed] [Google Scholar]
- Maydan, J. S., S. Flibotte, M. L. Edgley, J. Lau, R. R. Selzer et al., 2007. Efficient high-resolution deletion discovery in Caenorhabditis elegans by array comparative genomic hybridization. Genome Res. 17 337–347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maydan, J. S., H. M. Okada, S. Flibotte, M. L. Edgley and D. G. Moerman, 2009. De novo identification of single nucleotide mutations in Caenorhabditis elegans using array comparative genomic hybridization. Genetics 181 1673–1677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Michelmore, R. W., I. Paran and R. V. Kesseli, 1991. Identification of markers linked to disease-resistance genes by bulked segregant analysis: a rapid method to detect markers in specific genomic regions by using segregating populations. Proc. Natl. Acad. Sci. USA 88 9828–9832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moerman, D. G., and D. L. Baillie, 1979. Genetic organization in Caenorhabditis elegans: fine-structure analysis of the unc-22 gene. Genetics 91 95–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgan, T. H., A. H. Sturtevant, H. J. Muller and C. B. Bridges, 1922. The Mechanism of Mendelian Heredity. H. Holt & Co., New York.
- Muller, H. J., 1927. Artificial transmutation of the gene. Science 66 84–87. [DOI] [PubMed] [Google Scholar]
- Piette, J., D. Decuyper-Debergh and H. Gamper, 1985. Mutagenesis of the lac promoter region in M13 mp10 phage DNA by 4′-hydroxymethyl-4,5′,8-trimethylpsoralen. Proc. Natl. Acad. Sci. USA 82 7355–7359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rose, A. M., N. J. O'Neil, M. Bilenky, Y. S. Butterfield, N. Malhis et al., 2010. Genomic sequence of a mutant strain of Caenorhabditis elegans with an altered recombination pattern. BMC Genomics 11 131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Russell, W. L., E. M. Kelly, P. R. Hunsicker, J. W. Bangham, S. C. Maddux et al., 1979. Specific-locus test shows ethylnitrosourea to be the most potent mutagen in the mouse. Proc. Natl. Acad. Sci. USA 76 5818–5819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sarin, S., S. Prabhu, M. M. O'Meara, I. Pe'er and O. Hobert, 2008. Caenorhabditis elegans mutant allele identification by whole-genome sequencing. Nat. Methods 5 865–867. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sarin, S., V. Bertrand, H. Bigelow, A. Boyanov, M. Doitsidou et al., 2010. Analysis of multiple ethyl methanesulfonate-mutagenized Caenorhabditis elegans strains by whole-genome sequencing. Genetics 185 417–430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen, Y., S. Sarin, Y. Liu, O. Hobert and I. Pe'er, 2008. Comparing platforms for C. elegans mutant identification using high-throughput whole-genome sequencing. PLoS ONE 3 e4012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sladek, F. M., A. Melian and P. Howard-Flanders, 1989. Incision by UvrABC excinuclease is a step in the path to mutagenesis by psoralen crosslinks in Escherichia coli. Proc. Natl. Acad. Sci. USA 86 3982–3986. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sturtevant, A. H., 1965. A History of Genetics. Harper & Row, New York.
- Sulston, J., and J. Hodgkin, 1988. Methods, pp. 587–606 in The Nematode Caenorhabditis elegans, edited by W. B. Wood. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
- Swan, K. A., D. E. Curtis, K. B. McKusick, A. V. Voinov, F. A. Mapa et al., 2002. High-throughput gene mapping in Caenorhabditis elegans. Genome Res. 12 1100–1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vergara, I. A., A. K. Mah, J. C. Huang, M. Tarailo-Graovac, R. C. Johnsen et al., 2009. Polymorphic segmental duplication in the nematode Caenorhabditis elegans. BMC Genomics 10 329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wicks, S. R., R. T. Yeh, W. R. Gish, R. H. Waterston and R. H. Plasterk, 2001. Rapid gene mapping in Caenorhabditis elegans using a high density polymorphism map. Nat. Genet. 28 160–164. [DOI] [PubMed] [Google Scholar]
- Yandell, M. D., L. G. Edgar and W. B. Wood, 1994. Trimethylpsoralen induces small deletion mutations in Caenorhabditis elegans. Proc. Natl. Acad. Sci. USA 91 1381–1385. [DOI] [PMC free article] [PubMed] [Google Scholar]