Abstract
Metabarcoding is a valuable tool for characterizing the communities that underpin the functioning of ecosystems. However, current methods often rely on polymerase chain reaction (PCR) amplification for enrichment of marker genes. PCR can introduce significant biases that affect quantification and is typically restricted to one target loci at a time, limiting the diversity that can be captured in a single reaction. Here, we address these issues by using Cas9 to enrich marker genes for long-read nanopore sequencing directly from a DNA sample, removing the need for PCR. We show that this approach can effectively isolate a 4.5 kb region covering partial 18S and 28S rRNA genes and the ITS region in a mixed nematode community, and further adapt our approach for characterizing a diverse microbial community. We demonstrate the ability for Cas9-based enrichment to support multiplexed targeting of several different DNA regions simultaneously, enabling optimal marker gene selection for different clades of interest within a sample. We also find a strong correlation between input DNA concentrations and output read proportions for mixed-species samples, demonstrating the ability for quantification of relative species abundance. This study lays a foundation for targeted long-read sequencing to more fully capture the diversity of organisms present in complex environments.
Keywords: nanopore sequencing, Cas9, targeted enrichment, metabarcoding
1. Introduction
The characterization of microorganisms and meso-faunal communities is a powerful tool in environmental, agricultural, medical and biotechnological settings due to the essential roles these organisms play across diverse biological systems [1,2]. For example, nematodes are excellent bioindicators of soil health and quality and can be used to accurately predict primary decomposition pathways [3,4]. The main challenge to unlocking the use of this community knowledge is accurately identifying the often large numbers of species that are present and quantifying their relative abundance. Unfortunately, traditional methods like morphological identification are time-consuming and requires a high level of taxonomic expertise. This puts it outside the reach of most studies and limits the depth of analysis that can be performed [5].
Metabarcoding is a powerful alternative to traditional methods of identification that typically require visual observations and/or physical separation of individual organisms prior to further analysis (e.g. Gram staining of bacteria). Metabarcoding typically uses polymerase chain reaction (PCR) followed by high throughput sequencing to amplify and sequence molecular barcodes, such as the commonly used cytochrome c oxidase subunit I (COI) [6–8], 16S or 18S ribosomal (r)RNA genes [9,10], and internal transcribed spacer (ITS) region [11,12], from a metagenomic sample [13,14]. This allows for the identification of many species simultaneously from mixed species samples, like those taken from water or soil [8–10,15]. Metabarcoding has been valuable in answering ecological questions such as ecosystem biomonitoring [7,8,16], revealing dietary profiles using fecal DNA [17,18], reconstructing food webs [19] and ancient community dynamics from relic DNA [20].
While a useful tool, PCR-based metabarcoding does have limitations. First, the nucleotide composition of template DNA can influence PCR efficiency. For example, homopolymers, GC-rich regions, and inverted repeats can be problematic for amplification [21–24]. Also, stochastic variation early on in the PCR process can lead to significant bias in the final template-to-product ratio [21,25], causing skewed estimates of relative abundance of different species [21]. Furthermore, current PCR-based barcoding is often limited by the taxonomic resolution achievable through the use of a single barcode region (typically 200−600 bp long), which captures only partial variation of interspecific nucleotides along these genes [26,27].
Metabarcoding studies that aim to capture phylogenetically diverse communities may use a ‘universal barcode’ such as the COI gene [6]. However, the true ‘universality’ of universal barcodes is contested [28], where amplification success rates are low for certain taxonomic groups, skewing diversity estimates [27,29,30]. Due to these issues, metabarcoding studies that want to characterize complex communities often face a trade-off between diversity and resolution. To address this, multiple PCR assays can be performed to target different taxa within the sample, using optimized primer pairs for each target [12,16]. However, this is time consuming and hampers comparisons of abundance between taxa.
Third-generation sequencing platforms, from Oxford Nanopore Technologies (ONT), PacBio, Quantapore and Stratos, provide an alternative to amplification-based DNA sequencing. These technologies can sequence native DNA and RNA, eliminating the need for PCR. The long-read capabilities of these third-generation platforms also allow for longer barcode regions to be sequenced, capturing a greater proportion of interspecific nucleotide variation. This increases taxonomic resolution and allows for species, or even strain, identification [31–33]. Nanopore sequencing has been applied to microbial, nematode and vertebrate diversity studies [34–38]. There are two studies to date that use nanopore sequencing for nematode identification, both of which use individual organisms as starting material [37,39]. Nanopore sequencing has not yet been used to identify nematode species from metagenomic samples, as has been done for microbes [38]. Although not a requisite for nanopore sequencing, all of these long-read-based studies use PCR as a mode of enrichment of target genes and so still face issues in quantifying diverse communities [36,38].
A novel approach to enrich for specific nucleotide sequences employs the use of clustered regularly interspaced short palindromic repeats (CRISPR) systems. CRISPR naturally occurs as a defence system in bacteria and archaea against foreign viruses and plasmids using nucleases such as Cas9 to target and cleave specific DNA sequences [40,41]. The ability to easily target DNA cleavage via CRISPR (cr)RNAs has led to CRISPR becoming widely adopted as a flexible tool for genome editing [42,43]. CRISPR-based technology has also been used to isolate specific regions of native DNA for sequencing, removing the need for enrichment via PCR [44]. This approach is compatible with nanopore sequencing and has been used to isolate the human breast cancer gene BRCA1, to characterize genetic variants where the region of interest was excised and then isolated using pulsed-field gel electrophoresis (PFGE) [45].
More recently, a targeted nanopore sequencing approach using Cas9 was proposed [46]. This method relies on the cleavage of native target DNA by the Cas9 enzyme [47] and the attachment of sequencing adapters to the cleaved target DNA (figure 1A). To date, Cas9-based enrichment has been used to increase sequencing coverage at specific regions of interest to assess single nucleotide polymorphisms [48], structural variations [48], repetitive regions [49,50] and epimutations [50]. This method of enrichment also enables multiple DNA targets to be enriched simultaneously by multiplexing (i.e. combining) up to 100 crRNA probes that guide cleavage to multiple target regions in a single library preparation [46]. This contrasts with PCR amplification as PCR requires specific annealing temperatures for each PCR primer, making it unfeasible to include multiple primers that target different regions. While this type of enrichment has been utilized for environmental applications such as sequencing fish mitogenomes [51] and characterizing colour-controlling loci in apples [52], and even for mixed-species chloroplast sequencing [53], this CRISPR-Cas9 metabarcoding approach has not been used for micro-organisms or meso-fauna, or used in conjunction with multiplexed gRNAs to target multiple loci across diverse species.
Figure 1.
Overview of the Cas9-based targeted sequencing approach. (a) Major steps in the Cas9-based enrichment. 1. Targeting of Cas9 using crRNAs that bind to flanking points of a region of interest (ROI). 2. Cleavage of sample DNA and protection of fragment ends that should not be enriched. 3. Ligation of sequencing adapters to unprotected ends of DNA fragments from the ROI. 4. Nanopore sequencing and bioinformatics to recover enriched data for the ROI. (b) PCR amplification model. Left plot shows the expected PCR amplification of two different products, ‘A’ (black) and ‘B’ (red), that start at identical concentrations of 1 molecule/sample each using primers that are 100% and 90% efficient, respectively. Right plot shows the ratio B/A and the large deviation in the expected 1 : 1 ratio (dotted line) after 25 cycles of PCR. (c) Cas9-based enrichment model where Cas9–gRNA complexes are assumed in excess. Left plot shows the expected enrichment (as a fraction) of two different products ‘A’ (black) and ‘B’ (red) that have an equal concentration in the sample, but where the rate of cleavage of the Cas9–gRNA complex (i.e. enrichment) in B is half that of A (rates of 0.5 and 1.0, respectively). Right plot shows the ratio B/A and the strong trend towards the expected ratio of 1 (dotted line) as the length of the reaction increases.
In this paper, we address this gap and apply targeted nanopore sequencing in a metabarcoding context to detect and quantify species in singular and mixed species samples. We demonstrate that our method of amplification-free metabarcoding allows us to target multiple barcode regions across taxonomic groups to more fully characterize diverse communities in a single reaction. This removes the need to carry out multiple PCR reactions to fully uncover diversity, saving considerable time and resources. Furthermore, the lack of PCR helps to reduce potential biases surrounding differences in amplification efficiency between taxonomic groups, allowing for improved estimates of relative abundance. We propose this method as a tool to holistically characterize biodiversity in complex environments, exploiting the ability of amplification-free, long-read metabarcoding to enhance the breadth and depth of community profiling.
2. Results
2.1. Cas9-based enrichment for metabarcoding
Targeted nanopore sequencing using Cas9-based enrichment works by cleaving target DNA sequences at user-defined points and attachment of nanopore sequencing adapters to one of the cleaved ends (figure 1a) [46]. Prior to cleavage, the ends of extracted genomic DNA are dephosphorylated to avoid ligation of nanopore sequencing adapters to non-target DNA that may have been fragmented during extraction. Then, custom 20-mer CRISPR (cr)RNA sequences (‘probes’) are combined with catalytic trans-activating CRISPR (tracr)RNA, forming a guide (g)RNA targeting mechanism for the Cas9 nuclease, which can then bind and cleave at sequences identical or highly similar to the crRNA. The 20-mer target site (the protospacer) must be adjacent to an NGG-sequence protospacer-adjacent motif (PAM) for the Cas9 to function efficiently. The Cas9 cleaves the target sequence 3 bp upstream of the PAM. These cuts expose a 5′ phosphate group, onto which nanopore sequencing adapters can be ligated; however, the Cas9–gRNA complex also blocks ligation to the end of one of the fragments (the non-targeted region). Therefore, the resulting library contains DNA molecules where sequencing adapters are mostly attached to the target DNA. This enriched library results in a greater ratio of target to non-target DNA being sequenced than if no enrichment took place.
A potential benefit of using Cas-based enrichment rather than PCR amplification for targeted metabarcoding is that small differences in the amplification efficiency of primers used during PCR to target barcode regions across species can lead to large deviations in the recovered abundances. To demonstrate this, we developed idealized mathematical models of the PCR amplification and Cas9-based enrichment methods. In both, we assumed two different target species (‘A’ and ‘B’) with identical concentrations, but where amplification or enrichment efficiency differed between the species. For PCR amplification (figure 1b), if the efficiency of amplification of B is only 10% lower than for A, then after 25 cycles, the original ratio of B/A = 1 will have fallen to 0.29. In contrast, Cas9-based enrichment (figure 1c) is less affected by differences in the efficiency of cleavage for crRNAs in the target species. If the Cas9–gRNA complex is in excess for any enrichment reaction, and assuming the reaction is run for a sufficiently long time, then even when there are large differences in cleavage efficiency (e.g. cleavage rate of B is half that of A), correct ratios of the underlying species can still be recovered (figure 1c).
To evaluate the potential of Cas9-based enrichment for targeted metabarcoding, we tested its ability to enrich typical barcode regions in the nematode Caenorhabditis elegans. We designed crRNA probes to target a 4.5 kb region within the rDNA tandem repeat. The target region included partial 18S rDNA (v6–v8 regions), 28S rDNA (D1–D10 regions), both ITS regions and the complete 5.8S rDNA gene. We applied the enrichment strategy to extracted Caenorhabditis elegans genomic (g)DNA (figure 2). To measure enrichment, we calculated an enrichment score, , representing the fold increase in reads covering the target sequence compared to the expected coverage if reads were evenly distributed across the genome (i.e. without enrichment). This can be calculated using,
Figure 2.
Design of crRNA probes targeting rRNA encoding genomic regions. (a) crRNA probe locations within the rRNA encoding genomic regions for nematodes (top), bacteria (middle) and yeast (bottom). Direction of the crRNA is shown by a small black triangle. (b) Alignment of crRNA sequences to corresponding genomes. Sequences for species without documented genomes use 18S (for forward crRNAs) and 28S (for reverse crRNAs) sequences for this alignment (electronic supplementary material, data S2) as the crRNAs sit within these barcodes. Mismatches to crRNA (boxed sequence) highlighted in red. Note that the Cas9 ‘NGG’ protospacer adjacent motif (PAM) sequence (bold, blue) is not part of the crRNA sequence.
| (2.1) |
where is number of on-target reads, is total number of reads, is genome size, is length of the target region and is number of copies of the target region (see table 1 for genome sizes and copy numbers).
Table 1.
Organisms used in the mock communitiesa.
|
species |
type |
genome size (Mbp) |
gDNA composition by mass (%) |
rDNA copies per genome |
abundance % based on 16S and 18Sb |
|---|---|---|---|---|---|
|
Caenorhabditis elegans |
nematode |
100.2 |
variablec |
55d |
— |
|
Caenorhabditis remanei |
nematode |
124.8 |
variablec |
183d |
— |
|
Caenorhabditis drosophilae |
nematode |
51.3 |
variablec |
— |
— |
|
Heterorhabditis bacteriophora |
nematode |
77 |
variablec |
— |
— |
|
Steinernema feltiae |
nematode |
91 |
variablec |
— |
— |
|
Panagrellus redivivus |
nematode |
65 |
variablec |
— |
— |
|
Listeria monocytogenes |
bacteria |
2.99 |
12 |
6 |
12.4 |
|
Pseudomonas aeruginosa |
bacteria |
6.8 |
12 |
4 |
3.6 |
|
Bacillus subtilis |
bacteria |
4.0 |
12 |
10 |
15.3 |
|
Escherichia coli |
bacteria |
4.9 |
12 |
7 |
8.9 |
|
Salmonella enterica |
bacteria |
4.8 |
12 |
7 |
9.1 |
|
Lactobacillus fermentum |
bacteria |
1.9 |
12 |
5 |
16.1 |
|
Enterococcus faecalis |
bacteria |
2.8 |
12 |
4 |
8.7 |
|
Staphylococcus aureus |
bacteria |
2.7 |
12 |
6 |
13.6 |
|
Saccharomyces cerevisiae |
yeast |
12.1 |
2 |
109 |
9.3 |
|
Cryptococcus neoformans |
yeast |
18.9 |
2 |
60 |
3.3 |
Separate communities studied are grouped in the table and unknown quantities are denoted with a dash.
Calculated based on rDNA copy number and genome size given by Zymobiomics.
For details, see electronic supplementary material, data S2.
We obtained between 0.46 and 1.64 M reads from the sequencing experiments, with the number of reads increasing with the starting DNA mass (electronic supplementary material, data S2). A high level of enrichment of the target region was observed for all samples. A mean of 83.3% of reads were on target, with an average enrichment of = 337-fold that was consistently high across all replicates (E = 348-, 344-, 318-fold for replicate samples 1 a–c, respectively) (figure 3). Read coverage of the C. elegans genome increased sharply at the beginning of the target region and decreased sharply at the end of the target region, points where the ‘forward’ and ‘reverse’ crRNA probes guided cleavage, respectively (figure 3). Coverage for forward and reverse strands was approximately equal for all replicates, demonstrating equal activity of forward and reverse probes (figure 3). A distinct peak of 4.5 kb was observed in the raw read lengths in all three replicates, matching the anticipated fragment lengths from simulated Cas9 digestion (figure 4).
Figure 3.
Coverage plots of C. elegans depicting enrichment of target region. Coverage for targeted sequencing of a pure C. elegans samples given as the proportion of total forward/reverse bases per sequencing run that map to a nucleotide position. Target region covers genomic coordinates chrI:15063336−15067815 in the C. elegans genome, marked by the positions of the crRNAs: nemFI, nemFII, nemRI and nemRII. Solid lines represent the mean coverage across three replicates and shaded area denotes the absolute deviation from the mean.
Figure 4.
Read length distributions for the four communities assessed. Blue solid lines show the average of the three replicates and shaded blue regions denote the absolute deviation from the mean. Grey-shaded regions across samples correspond to the expected target lengths of the rRNA regions in bacteria, yeast and nematodes. Peaks for specific species have been annotated. The peak denoted with a star highlights unexpected 1.4 kb reads which are bacterial reads cleaved by off-target activity of yeast crRNAs.
2.2. Targeted nanopore sequencing of mock nematode communities
Having demonstrated that Cas9-based enrichment is effective for targeted sequencing of a barcode region in C. elegans, we next applied the same set of crRNA probes to metagenomic samples to test whether multiple species could be identified using this method. Two sets of custom mock nematode communities were used; one consisting of three species: C. elegans, Heterorhabditis bacteriophora and Steinernema feltiae (samples 2 a–c), and the other consisting of six species: C. elegans, H. bacteriophora, S. feltiae, C. remanei, C. drosophilae and Panagrellus redivivus (samples 3 a–c).
Distinct peaks between 4.2 and 4.9 kb were observed in the read length distributions for all nematode sequencing runs (figure 4), confirming the expected fragment lengths for the enrichment protocol. As the samples become more diverse in species, peaks become more numerous around the 4.5 kb mark due to interspecific variation in rDNA length, demonstrating the generation of target read lengths for the multiple species (figure 4).
Sequencing data were further analysed using a metagenomics workflow. When using a 98% similarity threshold between nanopore reads and reference sequences, all nematodes were detected to genus level in all replicates for both the three- and six-species communities. Between two to four extra genera were identified in the analysis of each dataset, but falsely positive genera had read abundances of 0.08% of the total mapped reads, and most had only a single read classified to the genus. When analysis was done to a species level, many false positives occurred, and up to 31 species were classified. Total input DNA ranged from 510 to 2917 ng for the mock nematode communities (electronic supplementary material, data S2), and the number of reads generated from sequencing increased as the amount of input DNA increased.
2.3. Multiplexing crRNA probes to characterize complex microbial communities
To analyse the relative abundance of taxa in a sample using DNA marker abundance, copy numbers of the genes being used are required to account for repeats of the target gene within a genome [56]. As this information is not known for all nematode species used in this study, quantitive analysis could not be performed for the nematode mock communities. To test the ability of this method to quantify relative abundance of species, we applied our method to the Zymo microbial community DNA standard. This contains 10 species of bacteria and yeast in known gDNA proportions with defined marker gene abundance, and available genome sequences (table 1). We multiplexed two sets of crRNA probes in a single library preparation, one set targeting the rDNA region in bacteria and the other set the rDNA region in yeast. The target regions included commonly used 16S, 18S and ITS barcode regions (figure 2).
Despite long sequencing times, the generation of reads tended to plateau after 12−24 h, depending on the amount of input DNA, and qualitative and quantitative results were consistent across samples despite differences in total input DNA mass and sequencing times (electronic supplementary material, data S2). Distinct peaks were observed for the expected read lengths of each taxon: 1.9 kb in C. neoformans, 2.2 kb in S. cerevisiae, and 3.3−3.7 kb in bacterial species (interspecific variation of rDNA lengths resulted in multiple peaks in the read length distributions) (figure 4; electronic supplementary material, table S1). A large unexpected peak was also observed around 1.4 kb (figure 4). Further investigation of this peak found that there is a small region in the bacterial genome, 1.4 kb upstream of the bacRII probe, with sequence similarity to the yeaRII probe, albeit with two mismatches. It is therefore likely that the peak is caused by off-target activity of the yeast probe, cleaving some of the target reads of the bacteria. Nevertheless, the remaining 1.4 kb of bacterial rDNA sequence is ample for taxonomic classification, and did not seem to skew results.
Analysis of the sequencing data using the metagenomics workflow estimated approximately 800 species in each sample, a gross overestimation of species richness compared to the ten species actually present in the mock community. However, most species from this report had an abundance of 1%, including expected species from the mock community. It was noted that nanopore R9.4.1 flow cells generate reads with an accuracy of 96.5% which is not sufficient to classify single reads to species level, and is the probable cause of the overestimated diversity. As a result, we decided to do the analysis to genus level. At the genus level, the metagenomics workflow estimated 380 genera, still a large overestimation, but all excess genera than those truly present had an abundance of 0.3%. The dominant ten genera in the metagenomics report matched the expected 10 genera in close to expected proportions (electronic supplementary material, data S3).
To more precisely assess the ability of our enrichment method to capture accurate relative species abundance, we made use of the known rDNA copy numbers and genome sizes of each organism within the microbial community (table 1). We carried out a quantitative analysis using an alignment workflow, providing the workflow with only reference sequences of the species present in the sample (the known ground truth). This helped us to remove confounding effects of the error-prone reads being compared to a very large reference database, inflating diversity estimates and possibly skewing relative abundance measures.
Agreement between the output proportions from our method and true community proportions (electronic supplementary material, data S3) was assessed using Lin’s concordance correlation coefficient (CCC) [57]. CCC assesses how well pairs of observations conform relative to another set, measuring both precision and accuracy. Observed versus expected proportions of each species exhibited high concordance, with the CCC = 0.87 (p = , t = 9.67, df = 29) for combined data across three replicates (figure 5). Yeast in our samples (S. cerevisiae and C. neoformans) were slightly over-represented in the output of all sequencing runs compared to bacteria, and L. monocytogenes was slightly under-represented (figure 5).
Figure 5.
Analysis of a mixed microbial community using multiplexed crRNAs to target prokaryotic and eukaryotic species simultaneously. (a) Percentage of reads for each species in the ZymoBIOMICS Microbial Community DNA Standard (Lm: Listeria monocytogenes, Pa: Pseudomonas aeruginosa, Bs, Bs: Bacillus subtilis, Ec: Escherichia coli, Se: Salmonella enterica, Lf: Lactobacillus fermentum, Ef: Enterococcus faecalis, Sa: Staphylococcus aureus, Sc: Saccharomyces cerevisiae, and Cn: Cryptococcus neoformans). Coloured bars correspond to measured values from targeted sequencing. Individual replicate measurements shown by open circles for bacteria and triangles for yeasts, and error bars show the standard deviation. Dark grey bars correspond to ground truth percentages of the species composition. (b) Comparison of observed and expected percentages of each species in the ZymoBIOMICS Microbial Community DNA Standard. Individual data points shown for each of the three biological replicates. Colouring of data points matches bar colours in (a). Dashed line denotes and is Lin’s concordance correlation coefficient.
3. Discussion
In this study, we have demonstrated the ability to perform targeted nanopore sequencing using Cas9-based enrichment for the identification of nematodes and microbes in metagenomics samples without PCR amplification. High levels of enrichment of the target rDNA region in C. elegans demonstrate this is a feasible alternative to enrichment of barcodes using PCR amplification, which can introduce bias when analysing metagenomic samples. The method reliably identified nematodes, bacteria, and yeast to genus level with the false positives at very low abundance. A key feature of this method is the ability to multiplex crRNA probes to target diverse phylogenies in one reaction, an approach that is not possible with PCR primers due to the different temperature requirements of each primer during annealing. Moreover, the combination of the Cas9-based enrichment method with a long-read sequencing platform allows long DNA regions to be sequenced, which can increase taxonomic resolution by encompassing full-length or multiple barcode regions that capture greater phylogenetic variation [31,32,58]. In our study, not only did multiplexing allow qualitative analysis of a community, but also measures of relative abundance of a diverse metagenomic community consisting of both prokaryotes and eukaryotes. The ability to multiplex crRNA probes to target diverse taxa in a single reaction is a promising and novel quality of the Cas9-based metabarcoding method. Up to 50 synthetic crRNA probes have been pooled in a targeted sequencing experiment [46], while a staggering 1100 in vitro transcribed (IVT) crRNAs were effectively used in a single targeted sequencing experiment [59]. Increasing the number of crRNAs might, however, increase the potential for off-target effects. Careful design of crRNAs and the use of bioinformatic tools to assess the specificity of the target sites could help minimize off-target effects. For successful multiplexing, it is important that gRNAs are in excess and that cleavage reactions are allowed to run for sufficient time to ensure complete cleavage. If these conditions are not met, there is a risk of over-representing low-abundance taxa because taxa with lower DNA abundance would have a higher gRNA : DNA ratio, leading to disproportionate enrichment. However, this potential bias can be avoided by ensuring that the reaction is allowed to proceed for sufficient time, allowing all target sequences to be properly cleaved (figure 1c).
While our method was reliable at identifying up to genus level for genera 0.3% abundance, there was an overestimation of taxa at species level. However, this is a limitation of error-prone nanopore sequences rather than the Cas9-based enrichment approach. The average error rate of nanopore reads of 5−6% using R9.4.1 flow cells explains the generation of reads with false dissimilarity to species reference sequences. The newest R10.4 flow cells achieve read accuracies of over 99.1% [60], and so adaptation of our method for these flow cells would greatly increase the accuracy of taxonomic assignments to the species level. Higher accuracy reads from R10.4 flow cells would also mean more reads would pass quality filtering, allowing a fuller exploitation of the available data. Alternatively, with some modification to adapter ligation and computational pipelines, the Cas9-based enrichment approach to metabarcoding could be adapted to other long-read sequencing platforms, such as Pacific Biosciences (PacBio).
While greater read accuracy would improve taxonomic assignment when target sequences have sufficient interspecific variation, it still holds that targeting short regions of a few hundred base pairs can limit delineation in closely related species [26]. We used an 18S reference database for taxonomic assignment of nematode sequences, but it has been shown that partial 18S sequences cannot distinguish between closely related Caenorhabditis species due to a lack of interspecific polymorphisms in that region [26]. While even family-level discrimination in nematodes can be sufficient for insight into ecosystem function [61], it is argued that species-level knowledge provides more information about community structure [62]. Furthermore, for other applications of metabarcoding, such as characterizing human microbiome samples, delineation to species level is advantageous [63,64]. Sequencing a longer region of multiple kilobases that spans full-length or multiple barcode regions can allow for greater taxonomic resolution [31,33,65]. Whilst we used an 18S reference database for nematode classification because of its compatibility with the EPI2ME bioinformatics workflow, our bioinformatics pipeline could be modified to maximally exploit our long-read data that spans small and large rDNA subunits, as well as ITS regions. Reference databases for each barcode of interest could be used together for cross-validation of reference sequences, as done by Heeger et al. for long-read barcoding of aquatic fungi using three databases simultaneously. This would provide better taxonomic resolution than using one single barcode [31].
The use of an amplification-free metabarcoding strategy bypasses difficulties associated with PCR amplification, such as problems amplifying low complexity regions, or stochastic variation early in the exponential process causing significant skewing of relative abundance estimates [21]. During PCR amplification, GC-rich regions are preferentially amplified due to differences in primer binding energies between AT and GC rich primers [21]. GC bias is strong in PCR primer–template hybridization due to its sole reliance on binding between the primer and template, and is highly influenced by the annealing temperature of the reaction. On the other hand, GC bias could be less pronounced in gRNA binding in CRISPR-Cas systems due to its additional PAM-site dependence and the stabilizing effects of Cas proteins, and its lower sensitivity to temperature changes [66].
Similarly to PCR primer design, there were some challenges in finding highly conserved sequences targetable by crRNAs. There were some mismatches between crRNA sequence and templates. However, even relative abundance estimates did not seem significantly affected by the mismatches. This could be due to two reasons. First, the use of two gRNAs on each side of each target region might mean that if one sequence is not cleaved, the other gRNA complex acts in a redundant manner. This is supported by findings that using multiple gRNAs for each region of interest improve cleavage rates [46]. Alternatively, tolerance of mismatches by Cas9–gRNA complexes [67] might mean that the mismatches in our crRNA sequences and target genomes did not significantly affect binding and cleavage. Generally, PAM-distal mismatches have a small impact on cleavage of target DNA, whereas PAM-proximal mismatches tend to have a greater effect on binding and cleavage, the degree to which depends on the nucleotide substitution [67]. Baranova et al. [67] found that a C G substitution on the first nucleotide upstream from the PAM affected cleavage time rather than the degree of cleavage, whereas a T C and T G substitutions reduced the rate and time of cleavage. So, whilst gRNA binding might not be subject to GC-bias when the probe sequence matches the template, when mismatches are involved, cleavage rates can vary. Our measures of relative abundance strongly matched true community proportions, but L. fermentum was one species that was slightly under-represented. This might have to do with the presence of two mismatches between crRNA sequence and template in this species, in the middle of the crRNA, rather than one mismatch in other sequences or two PAM-distal mismatches in one crRNA–template combination. Increasing numbers of mismatches has been found to decrease the rate of cleavage [67]. Differences in cleavage activity between crRNAs, whether due to mismatches or inherent differences [67], could potentially be addressed by using crRNA sequences with degeneracy at a nucleotide position, as is sometimes done for PCR primers [68], or extending the length of the cleavage reaction (figure 1c) to ensure cleavage runs to completion. Unlike PCR, where primer binding efficiency significantly impacts final product ratios, Cas9 cleavage tends toward a 1:1 ratio between template and enriched product.
Whilst some tolerance to template mismatches is a positive thing to allow crRNA sequences to bind to a greater diversity of sequences at one locus, this simultaneously means that off-target activity is also more likely. We encountered this problem when a sequence around 1.4 kb upstream of the reverse bacterial crRNA positions was likely cleaved by a yeast crRNA probe that has two mismatches and a suitable PAM site, shortening the expected 3.5 kb target region to 1.4 kb. Nevertheless, as full-length reference sequences were provided to the analysis workflow, the remaining 1.4 kb of bacterial sequences after off-target cleavage left ample sequence to successfully map the shortened read against the reference and did not seem to impact qualitative or quantitative analyses. However, if the full target region is required for the delineation of closely related species or strains, one should be aware of possible off-target sites within the region of interest when designing crRNA probes. Enzymatic specificity is often dependent on enzyme concentrations, amongst other reaction conditions, so future work could be done to optimize enzyme/gRNA complex concentrations depending on the amount of starting DNA material to minimize off-target activity. Off-target activity outside the region of interest should not cause issue, even if they are in a suitable orientation to cause extra reads to be generated, as sequences from other parts of the genome would be sufficiently divergent so as not to be mistakenly mapped to reference marker barcode sequences.
We found that our results were robust to varying amounts of starting DNA, ranging from 510 to 2917 ng in metagenomic samples. The number of reads generated from sequencing runs increased with increasing mass of starting DNA, but observed proportions of each species remained consistent across replicates. DNA yield that can be obtained by extraction from crude environmental samples can vary greatly [69], so testing the robustness of this method to smaller amounts of starting material would be beneficial for further method validation. Applying the method to DNA extracted directly from environmental samples such as soil or sediment would also be important for broader validation, as would testing the detection limit of Cas9-enrichment by executing similar experiments on mock communities with a larger range of abundances (e.g. using the ZymoBIOMICS microbial community standard with log distribution that contains species in abundances varying from 0.000089% to 89.1%).
In the current study, sequencing was run for up to 72 h to ensure maximal data were generated. However, read generation was typically exhausted after 12−24 h, indicating that such long sequencing times are not necessary. Knot et al. [37] sequenced nematode DNA on a MinION for only 10 min with accurate species assignment, while Hall et al. [65] suggest 1 Gb of sequence data are appropriate for ultra-long read bacterial species identification. The optimal trade-off between sequencing depth and cost will depend on the complexity of the community being sampled and the length of the target region. If the community of interest requires a short sequencing time, combining this with the fast library preparation of Cas9-based enrichment (under two hours) could be used for rapid diversity and relative abundance assessments, bypassing the time-consuming PCR amplification process used in other metabarcoding approaches.
In summary, we show that it is possible to apply Cas9-based enrichment for taxonomic classification and relative abundance measures of metagenomic samples. The ability to multiplex crRNAs to target diverse phylogenies, combined with long-read sequencing technology to increase taxonomic resolution, gives this method great potential for characterizing highly diverse biotic communities. It could also be used for rapid diversity assessments, bypassing the need for time-consuming PCR amplification. Adaptation of our approach to updated flow cell chemistry and powerful computational pipelines will enhance its species-delineation power, ensuring more holistic assessments of biodiversity.
4. Methods
4.1. Organisms used in the study
Strains of C. elegans (AA1), C. remanei (JU724), C. drosophilae (DF5112) and P. redivivus (MT8872) were obtained from the Caenorhabditis Genetics Centre (CGC). Heterorhabditis bacteriophora and Steinernema feltiae, entomopathogenic species commonly used as a biological control agents in horticulture, were bought in the form of Nemasys Biological Chafer Grub Killer (BASF) and Nemasys No Ants (BASF), respectively, containing live infective juveniles. The ZymoBIOMICS Microbial Community DNA Standard (Zymo Research, D6306) was used for testing the method on a diverse set of microbes.
4.2. Growth conditions
Nematode strains from the CGC were cultured on nematode growth medium (NGM) plates with OP50 E. coli for 11 days at 25°C using standard methods [70].
4.3. crRNA design
The use of multiple guides on each side of the region of interest has been demonstrated to improve cleavage rates [46]; therefore, two crRNAs were designed for each side of the region of interest (a total of four probes for each target region). In total, twelve crRNAs were designed; four targeting nematode rDNA, four targeting bacterial rDNA and four targeting yeast rDNA (electronic supplementary material, table S1). crRNAs were ordered from Integrated DNA Technologies (IDT).
To design the nematode crRNAs, the position of the rDNA cluster in C. elegans genome WBcel235 (RefSeq GCF_000002985.6) [71] was located using the UCSC genome browser [72]. Within this region, the CRISPR target track on UCSC genome browser was used to search for DNA sequences with NGG PAM sites targetable by the Cas9–gRNA complex [72]. Target sequences were selected based on a combination of an efficiency score [73], specificity (uniqueness of 20-mer sequence in the genome) [74], and the conservation of bases across species as displayed on the Multiz Alignments and Conservation track on the UCSC genome browser [72]. Highly conserved sequences were selected to maximize phylum-wide annealing (figure 2). The region of interest between the custom crRNA guides was 4500 bp long, capturing partial 18S (506 bp) and 28S (2980 bp) rRNA genes, full 5.8S rRNA gene and full ITS1 and ITS2 regions (figure 2). The position of our nemFII crRNA probe overlaps with the position of ‘NF1’ forward PCR primer site used for 18S nematode barcoding [75]. crRNA design was based on the C. elegans genome, and for other species present in the community that do not have annotated genomes, 18S and 28S sequences were aligned to determine the number and position of mismatches between crRNA and template for each species (figure 2; see electronic supplementary material, data S1 for sequence accession numbers).
The microbial mock community consisted of both bacteria and yeast species. To test the possibility of multiplexing crRNA probes for multiple targets in a single reaction, we designed two sets of crRNA probes to target the two separate taxonomic groups (electronic supplementary material, table S1; figure 2). We designed bacterial crRNA probes that capture 1430 bp of 16S rDNA region in bacterial genomes (figure 2), 16S being the most commonly used sequence for bacterial barcoding [76]. We also designed crRNA probes to target the yeast species present in the mock community, capturing the full nuclear ITS region (figure 2), widely used in fungal barcoding [77]. The final target regions also included other barcode regions that can be used for taxonomic classification, such as the ITS and partial 23S region for prokaryotes, and partial 18S, partial 28S, and ITS2 regions in eukaryotes (figure 2).
Genomes for each species (electronic supplementary material, data S1) were loaded into Geneious Prime version 2025.0 and the 16S or ITS regions were aligned for the bacteria and yeast, respectively. The ‘Find CRISPR sites’ tool was then used in Geneious, and CRISPR sites with the best combination of cutting efficiency scores and conserved bases across the species were chosen, whilst maximizing the length of the target barcode included within the cleavage sites (electronic supplementary material, table S1; figure 2).
4.4. DNA extraction
DNA extraction from nematodes was performed using the Monarch Genomic DNA purification kit (New England Biolabs, T3010S) and the tissue extraction protocol. Samples underwent ethanol precipitation to concentrate DNA. 3M sodium acetate was added in 0.1 volumes to each sample, followed by 2.5 volumes of absolute ethanol, and the samples were then incubated at −20°C for 30 min. Tubes were centrifuged at 14 000 × g for 30 min at room temperature and supernatant was removed and discarded. Samples were rinsed with 70%ethanol and centrifuged at 14 000 × g for 15 min. The supernatant was removed, the pellet air dried and then resuspended in 0.2 volumes of nuclease-free water. DNA quality was checked on a NanoPhotometer NP80 spectrophotometer (Implen). DNA in each sample was quantified using the Qubit broad-range dsDNA assay kit on a Qubit 3.0 fluorometer (Invitrogen). The molecular weight of the DNA was checked on a 1%agarose gel stained with GelGreen (Biotium, BT41005), loaded with 6X purple loading dye (NEB), using the 1 kb plus DNA ladder (New England Biolabs, N3200S). High molecular weight samples were retained for subsequent steps. DNA for the microbial mock community was purchased as a ready-to-use DNA mixture (Zymo Research, ZymoBIOMICS Microbial Community DNA Standard, D6306).
4.5. Library preparation and nanopore sequencing
Library preparation was done using the Cas9-based enrichment ligation sequencing kit (Oxford Nanopore Technologies, SQK-CS9109) following manufacturer’s protocol. In total, twelve libraries were prepared and sequenced independently (electronic supplementary material, data S2). Libraries were loaded onto a R9.4.1 flow cells (Oxford Nanopore Technologies, FLOMIN106D) and sequencing performed on a MinION Mk1B device using MinKNOW version 22.12.7. For applications that require the use of R10.4.1 flow cells (Oxford Nanopore Technologies, FLO-MIN114) for higher accuracy reads, we have included an adapted protocol (electronic supplementary material, note 1) that uses widely available reagents with the standard Ligation Sequencing Kit V14 (Oxford Nanopore Technologies, SQK-LSK114) to prepare compatible libraries.
To investigate the performance of the crRNA probes and the level of enrichment that can be achieved using this approach, the method was applied to a single nematode species, C. elegans. A single species was used in initial tests due to ease of mapping sequencing reads to a single genome, simplifying the analysis of enrichment efficiency and target region coverage. Nematode crRNA probes (electronic supplementary material, table S1; figure 2) were used in library preparation. Three separate libraries were prepared (Samples 1 a–c) with input gDNA quantities between 1400 ng and 2500 ng (electronic supplementary material, data S2), and these were sequenced independently for 20 hours.
Next, the method was applied to DNA samples containing a mixture of three nematode species (Samples 2 a–c; electronic supplementary material, data S2), C. elegans, H. bacteriophora and S. feltiae, to assess whether this method can identify multiple species in metagenomic samples. Again, nematode crRNA probes (electronic supplementary material, table S1) were used in library preparation. Using the same approach, this method was then applied to a mock community containing DNA from six nematode species (Samples 3 a–c; electronic supplementary material, data S2): C. elegans, H. bacteriophora, S. feltiae, C. remanei, C. drosophila and P. redivivus. Total starting DNA mass ranged from 510 to 2917 ng for each mock nematode community sample (electronic supplementary material, data S2).
Finally, to assess whether this method could be used to quantify diverse taxonomic groups in a single library preparation step, eight crRNA probes targeting both bacteria and yeast genomes (electronic supplementary material, table S1) were applied to a microbial mock community containing DNA from ten microbial species (ZymoBIOMICS Community DNA Standard, D6306). These eight crRNA probes were added in equimolar concentrations during library preparation as per the manufacturer’s standard protocol.
4.6. General bioinformatics
Base calling was performed using Guppy version 6.3.8. Base called reads in the FASTQ format were then filtered in Geneious Prime version 2024.0 such that the Q score was 10. Analysis of base called reads was done using EPI2ME version 4.1.3 functioning with Docker desktop version 4.17.0.
4.7. Cas9-based enrichment EPI2ME workflow
To analyse enrichment performance on a single genome, nanopore reads from the single species experiments (Samples 1 a–c) were analysed using the wf-cas9 workflow version 0.1.9 [78]. The full genomic sequence from C. elegans WBcel235 was used as the reference sequence (RefSeq GCF_000002985.6) [71,79] to map nanopore reads to. The ‘target region’, used for analysing on-target versus off-target reads, was defined as the region between the two innermost crRNA probes on either side of the region of interest (genomic coordinates chrI:15063241−15067815).
4.8. Metagenomics EPI2ME workflow
To assess the ability of our method to detect multiple species from metagenomic samples, the EPI2ME wf-metagenomics v2.11.0 workflow was used [78]. The workflow was run using the minimap2 sub-workflow to allow for taxonomic classification of reads by mapping reads against a reference database. For the EPI2ME Labs workflows, the reference can be chosen from a selection of databases covering archaeal, bacterial and fungal data. Minimum match between read and reference was set to 98%.
For our analysis of the microbial data, the NCBI targeted loci database including 16S, 18S, 28S and ITS sequences from archaea, bacteria and fungi, was selected. For analysis of metagenomic nematode data, custom reference files were created because the default EPI2ME databases do not include eukaryotic sequences, aside from fungi. We created a custom sequence database and mapping file for use with the wf-metagenomics workflow. This made use of the reference list from 18SNemaBase (18S_NemaBase.fasta) [80] and generated the ref2taxid_18S_NemaBase.txt file that maps the species of each reference to a valid taxid in NCBI. This is possible by using the taxid2name.txt file from taxonomy exports from the NCBI (often referred to as the NCBI taxdump). Any references not present in the NCBI mapping were given the taxid 28 384 (‘other sequence’). For this database to be used in the wf-metagenomics workflow, it was necessary to select the minimap2 aligner, using the 18S_NemaBase.fasta file as the reference library and ref2taxid_18S_NemaBase.txt as the ref2taxid mapping option. Scripts used to generate the database and mapping are available at: https://github.com/BiocomputeLab/18SNemaBase-EPI2ME.
4.9. Alignment EPI2ME workflow
The EPI2ME wf-alignment workflow version 0.3.3 [78] was used for quantitative analysis of microbial metagenomic sequences. Full length target sequences of each species in the mock community were extracted from genomic sequences (electronic supplementary material, data S1) and provided to the workflow as references to map nanopore reads against. Coverage of each reference sequence after mapping was used as the relative abundance data.
4.10. Statistical analysis
A Lin’s concordance correlation coefficient () was performed using RStudio version 4.4.1 to assess how well the observed proportions compare to the true proportions of each species in the microbial mock community. Data from three replicates were combined into one dataset, and a single coefficient was calculated for the combined quantitative data.
4.11. PCR amplification model
For the PCR amplification model, we assumed that the reaction started with an identical concentration of product ‘A’ and ‘B’ (ratio = 1) and after each cycle the concentration of a product was given by,
| (4.1) |
where is the concentration of a specific product at cycle and is the amplification efficiency of that product. We assume for each product, set for product A and (90% efficiency) for product B, and then sequentially calculate the concentrations at later cycles by applying equation (4.1).
4.12. Cas9-based enrichment model
For the Cas9-based enrichment model, we make use of the following time varying relationship for an enriched product:
| (4.2) |
where is the concentration of enriched product at time , is the starting concentration of the product, and is the rate of enrichment (i.e. cutting efficiency of the crRNA for the given product). We assume that we start with identical concentrations of each product ‘A’ and ‘B’ such that in both cases, and choose an enrichment rate of for A and for B. Ratios of the products can then be directly calculated from equation (4.2) for varying time points.
Ethics
This work did not require ethical approval from a human subject or animal welfare committee.
Data accessibility
Electronic supplementary material, data S1 contains the accession number for genomic sequences used for crRNA design and as reference sequences for mapping reads against. Electronic supplementary material, data S2 provides information on the content and composition of each library that was used for sequencing. Electronic supplementary material, data S3 outlines results for quantitative analysis workflows. Basecalled nanopore sequencing data and a snapshot of the custom 18S NemaBase reference database for EPI2ME that is stored in GitHub [81] have been archived within the Zenodo repository [82].
Supplementary material is available online [83].
Declaration of AI use
We have not used AI-assisted technologies in creating this article.
Authors’ contributions
L.N.-R.: conceptualization, investigation, methodology, writing—original draft, writing—review and editing; C.C.: conceptualization, funding acquisition, methodology, supervision, writing—review and editing; R.C.: conceptualization, funding acquisition, methodology; T.E.G.: conceptualization, funding acquisition, methodology, supervision, writing—review and editing.
All authors gave final approval for publication and agreed to be held accountable for the work performed therein.
Conflict of interest declaration
We declare we have no competing interests.
Funding
This work was funded by a Liv Sidse Jansen Memorial Foundation grant (to T.E.G., C.C. and R.C). In addition, T.E.G. was supported by a Royal Society University Research Fellowship grant URF\R\221008, a Turing Fellowship from the Alan Turing Institute under EPSRC grant EP/N510129/1, and the UKRI Engineering Biology Mission Award CYBER under BBSRC grant BB/Y007638/1. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. Some nematode strains were provided by the Caenorhabditis Genetics Centre, which is funded by the NIH Office of Research Infrastructure Programs (P40 OD010440).
References
- 1. Tarnowski MJ, Varliero G, Scown J, Phelps E, Gorochowski TE. 2023. Soil as a transdisciplinary research catalyst: from bioprospecting to biorespecting. R. Soc. Open Sci. 10, 230963. ( 10.1098/rsos.230963) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Besson M, Alison J, Bjerge K, Gorochowski TE, Høye TT, Jucker T, Mann HMR, Clements CF. 2022. Towards the fully automated monitoring of ecological communities. Ecol. Lett. 25, 2753–2775. ( 10.1111/ele.14123) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Ferris H, Bongers T, de Goede RGM. 2017. A framework for soil food web diagnostics: extension of the nematode faunal analysis concept. Appl. Soil Ecol. 18, 13–29. ( 10.1016/s0929-1393(01)00152-4) [DOI] [Google Scholar]
- 4. Ferris H. 2010. Form and function: metabolic footprints of nematodes in the soil food web. Eur. J. Soil Biol. 46, 97–104. ( 10.1016/j.ejsobi.2010.01.003) [DOI] [Google Scholar]
- 5. Coomans A. 2002. Present status and future of nematode systematics. Nematology 4, 573–582. ( 10.1163/15685410260438836) [DOI] [Google Scholar]
- 6. Hebert PDN, Cywinska A, Ball SL, deWaard JR. 2003. Biological identifications through DNA barcodes. Proc. R. Soc. B 270, 313–321. ( 10.1098/rspb.2002.2218) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Yu DW, Ji Y, Emerson BC, Wang X, Ye C, Yang C, Ding Z. 2012. Biodiversity soup: metabarcoding of arthropods for rapid biodiversity assessment and biomonitoring. Methods Ecol. Evol. 3, 613–623. ( 10.1111/j.2041-210x.2012.00198.x) [DOI] [Google Scholar]
- 8. Kuntke F, de Jonge N, Hesselsøe M, Lund Nielsen J. 2020. Stream water quality assessment by metabarcoding of invertebrates. Ecol. Indic. 111, 105982. ( 10.1016/j.ecolind.2019.105982) [DOI] [Google Scholar]
- 9. Holman LE, de Bruyn M, Creer S, Carvalho G, Robidart J, Rius M. 2019. Detection of introduced and resident marine species using environmental DNA metabarcoding of sediment and water. Sci. Rep. 9, 11559. ( 10.1038/s41598-019-47899-7) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Kawanobe M, Toyota K, Ritz K. 2021. Development and application of a DNA metabarcoding method for comprehensive analysis of soil nematode communities. Appl. Soil Ecol. 166, 103974. ( 10.1016/j.apsoil.2021.103974) [DOI] [Google Scholar]
- 11. Op De Beeck M, Lievens B, Busschaert P, Declerck S, Vangronsveld J, Colpaert JV. 2014. Comparison and validation of some ITS primer pairs useful for fungal metabarcoding studies. PLoS One 9, e97629. ( 10.1371/journal.pone.0097629) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Schmidt PA, Bálint M, Greshake B, Bandow C, Römbke J, Schmitt I. 2013. Illumina metabarcoding of a soil fungal community. Soil Biol. Biochem. 65, 128–132. ( 10.1016/j.soilbio.2013.05.014) [DOI] [Google Scholar]
- 13. Taberlet P, Coissac E, Pompanon F, Brochmann C, Willerslev E. 2012. Towards next‐generation biodiversity assessment using DNA metabarcoding. Mol. Ecol. 21, 2045–2050. ( 10.1111/j.1365-294x.2012.05470.x) [DOI] [PubMed] [Google Scholar]
- 14. Deiner K, et al. 2017. Environmental DNA metabarcoding: transforming how we survey animal and plant communities. Mol. Ecol. 26, 5872–5895. ( 10.1111/mec.14350) [DOI] [PubMed] [Google Scholar]
- 15. Kirse A, Bourlat SJ, Langen K, Fonseca VG. 2021. Unearthing the potential of soil eDNA metabarcoding—towards best practice advice for invertebrate biodiversity assessment. Front. Ecol. Evol. 9, 630560. ( 10.3389/fevo.2021.630560) [DOI] [Google Scholar]
- 16. Stat M, Huggett MJ, Bernasconi R, DiBattista JD, Berry TE, Newman SJ, Harvey ES, Bunce M. 2017. Ecosystem biomonitoring with eDNA: metabarcoding across the tree of life in a tropical marine environment. Sci. Rep. 7, 12240. ( 10.1038/s41598-017-12501-5) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Srivathsan A, Ang A, Vogler AP, Meier R. 2016. Fecal metagenomics for the simultaneous assessment of diet, parasites, and population genetics of an understudied primate. Front. Zool. 13, 17. ( 10.1186/s12983-016-0150-4) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Bonato L, Peretti E, Sandionigi A, Bortolin F. 2021. The diet of major predators of forest soils: a first analysis on syntopic species of Chilopoda through DNA metabarcoding. Soil Biol. Biochem. 158, 108264. ( 10.1016/j.soilbio.2021.108264) [DOI] [Google Scholar]
- 19. Pompanon F, Deagle BE, Symondson WOC, Brown DS, Jarman SN, Taberlet P. 2012. Who is eating what: diet assessment using next generation sequencing. Mol. Ecol. 21, 1931–1950. ( 10.1111/j.1365-294x.2011.05403.x) [DOI] [PubMed] [Google Scholar]
- 20. Thomsen PF, Willerslev E. 2015. Environmental DNA—an emerging tool in conservation for monitoring past and present biodiversity. Biol. Conserv. 183, 4–18. ( 10.1016/j.biocon.2014.11.019) [DOI] [Google Scholar]
- 21. Polz MF, Cavanaugh CM. 1998. Bias in template-to-product ratios in multitemplate PCR. Appl. Environ. Microbiol. 64, 3724–3730. ( 10.1128/aem.64.10.3724-3730.1998) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. McDowell DG, Burns NA, Parkes HC. 1998. Localised sequence regions possessing high melting temperatures prevent the amplification of a DNA mimic in competitive PCR. Nucleic Acids Res. 26, 3340–3347. ( 10.1093/nar/26.14.3340) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Kanagawa T. 2003. Bias and artifacts in multitemplate polymerase chain reactions(PCR). J. Biosci. Bioeng. 96, 317–323. ( 10.1263/jbb.96.317) [DOI] [PubMed] [Google Scholar]
- 24. Pinto AJ, Raskin L. 2012. PCR biases distort bacterial and archaeal community structure in pyrosequencing datasets. PLoS One 7, e43093. ( 10.1371/journal.pone.0043093) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Suzuki MT, Giovannoni SJ. 1996. Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Appl. Environ. Microbiol. 62, 625–630. ( 10.1128/aem.62.2.625-630.1996) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Fitch DH, Bugaj-Gaweda B, Emmons SW. 1995. 18S ribosomal RNA gene phylogeny for some Rhabditidae related to Caenorhabditis. Mol. Biol. Evol. 12, 346–358. ( 10.1093/oxfordjournals.molbev.a040207) [DOI] [PubMed] [Google Scholar]
- 27. Horton DJ, Kershner MW, Blackwood CB. 2017. Suitability of PCR primers for characterizing invertebrate communities from soil and leaf litter targeting metazoan 18S ribosomal or cytochrome oxidase I (COI) genes. Eur. J. Soil Biol. 80, 43–48. ( 10.1016/j.ejsobi.2017.04.003) [DOI] [Google Scholar]
- 28. Sharma P, Kobayashi T. 2014. Are ‘universal’ DNA primers really universal? J. Appl. Genet. 55, 485–496. ( 10.1007/s13353-014-0218-9) [DOI] [PubMed] [Google Scholar]
- 29. Derycke S, Vanaverbeke J, Rigaux A, Backeljau T, Moens T. 2010. Exploring the use of cytochrome oxidase c subunit 1 (COI) for DNA barcoding of free-living marine nematodes. PLoS One 5, e13716. ( 10.1371/journal.pone.0013716) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Blankenship LE, Yayanos AA. 2005. Universal primers and PCR of gut contents to study marine invertebrate diets. Mol. Ecol. 14, 891–899. ( 10.1111/j.1365-294x.2005.02448.x) [DOI] [PubMed] [Google Scholar]
- 31. Heeger F, Bourne EC, Baschien C, Yurkov A, Bunk B, Spröer C, Overmann J, Mazzoni CJ, Monaghan MT. 2018. Long‐read DNA metabarcoding of ribosomal RNA in the analysis of fungi from aquatic environments. Mol. Ecol. Resour. 18, 1500–1514. ( 10.1111/1755-0998.12937) [DOI] [PubMed] [Google Scholar]
- 32. Fang C, et al. 2023. High-resolution single-molecule long-fragment rRNA gene amplicon sequencing of bacterial and eukaryotic microbial communities. Cell Rep. Methods 3, 100437. ( 10.1016/j.crmeth.2023.100437) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Franzén O, Hu J, Bao X, Itzkowitz SH, Peter I, Bashir A. 2015. Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering. Microbiome 3, 14. ( 10.1186/s40168-015-0105-6) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Benítez-Páez A, Portune KJ, Sanz Y. 2016. Species-level resolution of 16S rRNA gene amplicons sequenced through the MinIONTM portable nanopore sequencer. Gigascience 5, 4. ( 10.1186/s13742-016-0111-z) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Benítez-Páez A, Sanz Y. 2017. Multi-locus and long amplicon sequencing approach to study microbial diversity at species level using the MinIONTM portable nanopore sequencer. Gigascience 6, x043. ( 10.1093/gigascience/gix043) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Menegon M, et al. 2017. On site DNA barcoding by nanopore sequencing. PLoS One 12, e0184741. ( 10.1371/journal.pone.0184741) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Knot IE, Zouganelis GD, Weedall GD, Wich SA, Rae R. 2020. DNA barcoding of nematodes using the MinION. Front. Ecol. Evol. 8, 100. ( 10.3389/fevo.2020.00100) [DOI] [Google Scholar]
- 38. Urban L, et al. 2021. Freshwater monitoring by nanopore sequencing. eLife 10, e61504. ( 10.7554/eLife.61504) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Sellers GS, Jeffares DC, Lawson B, Prior T, Lunt DH. 2021. Identification of individual root-knot nematodes using low coverage long-read sequencing. PLoS One 16, e0253248. ( 10.1371/journal.pone.0253248) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Mojica FJM, Díez‐Villaseñor C, Soria E, Juez G. 2000. Biological significance of a family of regularly spaced repeats in the genomes of Archaea, Bacteria and mitochondria. Mol. Microbiol. 36, 244–246. ( 10.1046/j.1365-2958.2000.01838.x) [DOI] [PubMed] [Google Scholar]
- 41. Mojica FJM, Díez-Villaseñor C, García-Martínez J, Soria E. 2005. Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J. Mol. Evol. 60, 174–182. ( 10.1007/s00239-004-0046-3) [DOI] [PubMed] [Google Scholar]
- 42. Adli M. 2018. The CRISPR tool kit for genome editing and beyond. Nat. Commun. 9, 1911. ( 10.1038/s41467-018-04252-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Sander JD, Joung JK. 2014. CRISPR-Cas systems for editing, regulating and targeting genomes. Nat. Biotechnol. 32, 347–355. ( 10.1038/nbt.2842) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Loose M. 2018. Finding the needle: targeted nanopore sequencing and CRISPR-Cas9. CRISPR J. 1, 265–267. ( 10.1089/crispr.2018.29028.mlo) [DOI] [PubMed] [Google Scholar]
- 45. Bennett-Baker PE, Mueller JL. 2017. CRISPR-mediated isolation of specific megabase segments of genomic DNA. Nucleic Acids Res. 45, e165–e165. ( 10.1093/nar/gkx749) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Gilpatrick T, et al. 2020. Targeted nanopore sequencing with Cas9-guided adapter ligation. Nat. Biotechnol. 38, 433–438. ( 10.1038/s41587-020-0407-5) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Meaker GA, Hair EJ, Gorochowski TE. 2020. Advances in engineering CRISPR-Cas9 as a molecular Swiss army knife. Synth. Biol. 5, a021. ( 10.1093/synbio/ysaa021) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Fiol A, Jurado-Ruiz F, López‑Girona E, Aranzana MJ. 2022. An efficient CRISPR-Cas9 enrichment sequencing strategy for characterizing complex and highly duplicated genomic regions. A case study in the Prunus salicina LG3-MYB10 genes cluster. Plant Methods 18, 105. ( 10.1186/s13007-022-00937-4) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Hafford-Tear NJ, et al. 2019. CRISPR/Cas9-targeted enrichment and long-read sequencing of the Fuchs endothelial corneal dystrophy–associated TCF4 triplet repeat. Genet. Med. 21, 2092–2102. ( 10.1038/s41436-019-0453-x) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Giesselmann P, et al. 2019. Analysis of short tandem repeat expansions and their methylation state with nanopore sequencing. Nat. Biotechnol. 37, 1478–1481. ( 10.1038/s41587-019-0293-x) [DOI] [PubMed] [Google Scholar]
- 51. Ramón‐Laca A, Gallego R, Nichols KM. 2023. Affordable de novo generation of fish mitogenomes using amplification‐free enrichment of mitochondrial DNA and deep sequencing of long fragments. Mol. Ecol. Resour. 23, 818–832. ( 10.1111/1755-0998.13758) [DOI] [PubMed] [Google Scholar]
- 52. López-Girona E, Davy MW, Albert NW, Hilario E, Smart MEM, Kirk C, Thomson SJ, Chagné D. 2020. CRISPR-Cas9 enrichment and long read sequencing for fine mapping in plants. Plant Methods 16, 6. ( 10.1186/s13007-020-00661-x) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Littleford‐Colquhoun B, Kartzinel TR. 2024. A CRISPR‐based strategy for targeted sequencing in biodiversity science. Mol. Ecol. Resour. 24, e13920. ( 10.1111/1755-0998.13920) [DOI] [PubMed] [Google Scholar]
- 54. Sulston JE, Brenner S. 1974. The DNA of Caenorhabditis elegans. Genetics 77, 95–104. ( 10.1093/genetics/77.1.95) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Bik HM, Fournier D, Sung W, Bergeron RD, Thomas WK. 2013. Intra-genomic variation in the ribosomal repeats of nematodes. PLoS One 8, e78230. ( 10.1371/journal.pone.0078230) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Darby BJ, Todd TC, Herman MA. 2013. High‐throughput amplicon sequencing of rRNA genes requires a copy number correction to accurately reflect the effects of management practices on soil nematode community structure. Mol. Ecol. 22, 5456–5471. ( 10.1111/mec.12480) [DOI] [PubMed] [Google Scholar]
- 57. Lin LIK. 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255. ( 10.2307/2532051) [DOI] [PubMed] [Google Scholar]
- 58. Petrone JR, Rios Glusberger P, George CD, Milletich PL, Ahrens AP, Roesch LFW, Triplett EW. 2023. RESCUE: a validated nanopore pipeline to classify bacteria through long-read, 16S-ITS-23S rRNA sequencing. Front. Microbiol. 14, 1201064. ( 10.3389/fmicb.2023.1201064) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Gilpatrick T, Wang JZ, Weiss D, Norris AL, Eshleman J, Timp W. 2023. IVT generation of guideRNAs for Cas9-enrichment nanopore sequencing. bioRxiv. ( 10.1101/2023.02.07.527484) [DOI]
- 60. Ni Y, Liu X, Simeneh ZM, Yang M, Li R. 2023. Benchmarking of Nanopore R10.4 and R9.4.1 flow cells in single-cell whole-genome amplification and whole-genome shotgun sequencing. Comput. Struct. Biotechnol. J. 21, 2352–2364. ( 10.1016/j.csbj.2023.03.038) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Bongers T, Ferris H. 1999. Nematode community structure as a bioindicator in environmental monitoring. Trends Ecol. Evol. 14, 224–228. ( 10.1016/S0169-5347(98)01583-3) [DOI] [PubMed] [Google Scholar]
- 62. Mulder C, Schouten AJ, Hund-Rinke K, Breure AM. 2005. The use of nematodes in ecological soil classification and assessment concepts. Ecotoxicol. Environ. Saf. 62, 278–289. ( 10.1016/j.ecoenv.2005.03.028) [DOI] [PubMed] [Google Scholar]
- 63. Fettweis JM, Serrano MG, Sheth NU, Mayer CM, Glascock AL, Brooks JP, Jefferson KK, Buck GA. 2012. Species-level classification of the vaginal microbiome. BMC Genom. 13, S17. ( 10.1186/1471-2164-13-S8-S17) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Gupta VK, Kim M, Bakshi U, Cunningham KY, Davis JM III, Lazaridis KN, Nelson H, Chia N, Sung J. 2020. A predictive index for health status using species-level gut microbiome profiling. Nat. Commun. 11, 4635. ( 10.1038/s41467-020-18476-8) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Hall GA, Speed TP, Woodruff CJ. 2020. Strain-level sample characterisation using long reads and mapq scores. bioRxiv. ( 10.1101/2020.10.18.344739) [DOI]
- 66. Jinek M, et al. 2014. Structures of Cas9 endonucleases reveal RNA-mediated conformational activation. Science 343, 1247997. ( 10.1126/science.1247997) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67. Baranova SV, Zhdanova PV, Lomzov AA, Koval VV, Chernonosov AA. 2022. Structure- and content-dependent efficiency of Cas9-assisted DNA cleavage in genome-editing systems. Int. J. Mol. Sci. 23, 13889. ( 10.3390/ijms232213889) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Green SJ, Venkatramanan R, Naqib A. 2015. Deconstructing the polymerase chain reaction: understanding and correcting bias associated with primer degeneracies and primer-template mismatches. PLoS One 10, e0128122. ( 10.1371/journal.pone.0128122) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Wagner AO, Praeg N, Reitschuler C, Illmer P. 2015. Effect of DNA extraction procedure, repeated extraction and ethidium monoazide (EMA)/propidium monoazide (PMA) treatment on overall DNA yield and impact on microbial fingerprints for bacteria, fungi and archaea in a reference soil. Appl. Soil Ecol. 93, 56–64. ( 10.1016/j.apsoil.2015.04.005) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Stiernagle T. 2006. Maintenance of C. elegans. WormBook 1–11. ( 10.1895/wormbook.1.101.1) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. C Elegans Sequencing Consortium . 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018. ( 10.1126/science.282.5396.2012) [DOI] [PubMed] [Google Scholar]
- 72. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler and D. 2002. The human genome browser at UCSC. Genome Res. 12, 996–1006. ( 10.1101/gr.229102) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73. Doench JG, et al. 2016. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. 34, 184–191. ( 10.1038/nbt.3437) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74. Concordet JP, Haeussler M. 2018. CRISPOR: intuitive guide selection for CRISPR/Cas9 genome editing experiments and screens. Nucleic Acids Res. 46, W242–W245. ( 10.1093/nar/gky354) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Porazinska DL, et al. 2009. Evaluating high‐throughput sequencing as a method for metagenomic analysis of nematode diversity. Mol. Ecol. Resour. 9, 1439–1450. ( 10.1111/j.1755-0998.2009.02611.x) [DOI] [PubMed] [Google Scholar]
- 76. Větrovský T, Baldrian P. 2013. The variability of the 16S rRNA gene in bacterial genomes and its consequences for bacterial community analyses. PLoS One 8, e57923. ( 10.1371/journal.pone.0057923) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77. Schoch CL, et al. 2012. Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for fungi. Proc. Natl. Acad. Sci. USA 109, 6241–6246. ( 10.1073/pnas.1117018109) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78. Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. 2020. The nf-core framework for community-curated bioinformatics pipelines. Nat. Biotechnol. 38, 276–278. ( 10.1038/s41587-020-0439-x) [DOI] [PubMed] [Google Scholar]
- 79. Sayers EW, et al. 2023. Database resources of the National Center for Biotechnology information in 2023. Nucleic Acids Res. 51, D29–D38. ( 10.1093/nar/gkac1032) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80. Gattoni K, Gendron EMS, Sandoval-Ruiz R, Borgemeier A, McQueen JP, M. Shepherd R, Slos D, O. Powers T, L. Porazinska D. 2023. 18S-nemabase: curated 18S rRNA database of nematode sequences. J. Nematol. 55, 20230006. ( 10.2478/jofnem-2023-0006) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81. Gorochowski T. 2025. Exploring the computational architecture of biological systems. See https://github.com/BiocomputeLab/18SNemaBase-EPI2ME.
- 82. Nikolaeva-Reynolds L, Cammies C, Crichton R, Gorochowski T. 2025. Cas9-based enrichment for targeted long-read metabarcoding. Zenodo. ( 10.5281/zenodo.14250758) [DOI]
- 83. Nikolaeva-Reynolds L, Cammies C, Crichton R, Gorochowski T. 2025. Supplementary material from: Cas9-based enrichment for targeted long-read metabarcoding. Figshare. ( 10.6084/m9.figshare.c.7742683) [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Electronic supplementary material, data S1 contains the accession number for genomic sequences used for crRNA design and as reference sequences for mapping reads against. Electronic supplementary material, data S2 provides information on the content and composition of each library that was used for sequencing. Electronic supplementary material, data S3 outlines results for quantitative analysis workflows. Basecalled nanopore sequencing data and a snapshot of the custom 18S NemaBase reference database for EPI2ME that is stored in GitHub [81] have been archived within the Zenodo repository [82].
Supplementary material is available online [83].





