Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2021 Jun 16.
Published in final edited form as: Lab Anim (NY). 2019 Sep 1;48(9):267–268. doi: 10.1038/s41684-019-0371-1

An new reference genome sequence for Caenorhabditis elegans?

Kevin L Howe 1
PMCID: PMC7610999  EMSID: EMS127445  PMID: 31406358

A new study “recompletes” the C. elegans genome sequence, revealing hitherto unseen genes.

The nematode Caenorhabditis elegans was the first multicellular organism to have its genome completely sequenced. The “essentially complete” sequence was published in 19981, and it has been iteratively improved and refined in the twenty years since. Today, the sequence comprises six gap-free chromosomal sequences (with an additional mitochondrial sequence) and is generally regarded as the most complete and accurate metazoan genome sequence available. A recent paper by Yoshimura and colleagues2 describes a new version of the C. elegans genome sequence. Why is this needed, and why does this represent a valuable new resource for C. elegans genetics and genomics?

The first motivation is accuracy. The canonical reference sequence for C. elegans is derived from a strain that geneticists use ubiquitously as the wild-type worm, known as N2. Many laboratories around the world maintain their own stocks of N2 that are, in theory, all clonal derivatives of the original hermaphrodite selected by Sydney Brenner in the 1960s when he established the field of C. elegans genetics3. However, by the time the genome sequencing project began in the 1980s, individual stocks of N2 (e.g. maintained by different labs) had acquired significant genetic changes during propagation, leading to subpopulations of N2 that are genetically distinct. As a consequence, the current reference genome sequence is essentially a genetic mosaic that does not correspond precisely to any N2 strain that currently exists. The new genome produced by Yoshimura et al is based on data from a wild-type strain closely related to N2, known asVC2010, for which plentiful supplies have been frozen and deposited in the Caenorhabditis Genetics Centre, a resource operated by the University of Minnesota, Twin Cities. This means that for the first time, geneticists can conduct experiments using a strain that is matched exactly with its genome sequence.

The second motivation is completeness. Two prior studies demonstrated (using different technologies) that there may be as much as 2 million nucleotides missing from the current N2 sequence4,5. Although both studies discovered new sequence, the genome assemblies they produced were otherwise of inferior quality to the canonical N2 sequence. Yoshimura et al have used state-of-the-art sequencing and assembly methods to produce a C. elegans genome sequence that not only reveals the hidden sequence previously postulated, but places it into the context of a highly accurate and highly contiguous assembly that rivals the quality of the current N2 sequence.

Almost all eukaryotic genomes contain repetitive sequence to a greater or lesser degree, and these present the main challenge in producing highly accurate and contiguous genome sequences. The emergence of so-called “long read” sequencing platforms, such as those produced by Pacific BioSciences (PacBio) and Oxford Nanopore Technologies (ONT), help to address this problem. Consequently, the development of computational algorithms that “assemble” chromosomes from shorter fragments of DNA sequence has re-emerged as an active field of research, with the aim of producing perfect, gap-free chromosome sequences. However, these algorithms are still confounded by very long stretches of repetitive DNA, which can result in assemblies with many gaps. Yoshimura et al took a novel approach, deploying a number of different assembly algorithms on different subsets of the data. The idea behind this is that each “recipe” encodes a different way of dealing with the repeats. This results in a certain set of gaps in the assembly but because those gaps are largely independent, the assemblies could be merged using the N2 sequence as a guide; this produced an assembly containing only 5 gaps. The authors then used ultra-long ONT sequences to attempt to bridge the remaining gaps. The published version of the assembly has only two remaining gaps, with four of the worm’s six chromosomes now gap free.

The most striking feature of the new genome sequence is that it contains 1.8 million nucleotides of additional DNA sequence compared with the N2 sequence. The majority of this (around 80%) is expansions of “tandem repeats”: segments of DNA that occur in multiple exact (or near exact) copies next to each other. Many ribosomal RNA (rRNA) and transfer RNA (tRNA) genes are arranged in the genome in clusters of tandem repeats; the VC2010 sequence contains many more copies of these genes compared with the reference genome. The authors propose an explanation for the under-representation of these genic repeats in the N2 sequence: that they were lost in the process of cloning the C. elegans DNA in Yeast Artificial Chromosomes (YACs). Their hypothesis is supported by the observation that expansions were enriched in regions of the genome cloned as YACs compared with other methods. They also remark, however, that even the new VC2010 assembly likely under-represents the full complement of these repeats, and that further work will be needed to elucidate the true number of copies of these RNA genes.

Aside from tandem repeat expansions, the VC2010 assembly includes numerous other insertions, deletions, and duplications with respect to the N2 sequence. In total, the authors propose 53 new protein-coding genes that lie either partially or completely in regions novel to VC2010. Some of these genes lie in segments of DNA that are present only once in the N2 sequence but occur in multiple near-identical copies in VC2010. The authors comment that long-read genome assembly allows, for the first time, the full detection of multigene families, which was only partially possible using traditional methods.

With each new, improved version of a reference genome, analyses and annotations performed against the previous version of the genome either need to be “lifted over” to the new version or re-performed from scratch, both of which are costly in terms of resources. It can therefore often be several months, or even (in the case of certain large functional genomics projects) years before the new version of the genome reaches the same level of analysis and annotation as the previous version. This effect has been seen for many genomes, most strikingly the human genome. The WormBase project6 has acted as the custodian of the C. elegans reference genome for the past 20 years. Currently, WormBase presents the VC2010 sequence as an alternative to the N2 reference sequence. However, it remains unclear at this stage as to whether VC2010 will ultimately be a replacement for the N2 sequence. That will depend (in part) on the extent to which the C. elegans community switch to using VC2010 as the wild-type strain for their laboratory work.

Work continues on improving the VC2010 sequence, and the closing of the two remaining gaps is anticipated in the next year (E. Schwarz, pers. comm.). Another recently published study7 describes a new high-quality genome assembly of a wild isolate strain of C. elegans (CB4856) commonly used in mapping experiments due to its high degree of genetic divergence from N2. Decreasing costs of long-read sequencing and refining of assembly pipelines such as those deployed by Yoshimura et al will likely pave the way for the emergence of high-quality and complete genome assemblies for many C. elegans strains that have been collected and systematized8. Making these data sets available in a way that facilitates comparison of, and information transfer between, strain genomes remains a challenge. One potential fruitful area of research is in the area of genome graphs9, in which the genome sequences for many individuals (or strains) are captured elegantly and efficiently in a single data structure. As these methods mature, a potential model to consider in the future is the provision of a reference genome graph for C. elegans that captures variation across all strains.

Footnotes

Disclaimer:

Howe is a principal investigator with WormBase.

References

RESOURCES