Abstract
Probe-based nucleic acid enrichment represents an effective route to enhance the detection capacity of next-generation sequencing (NGS) in a set of clinically diverse and relevant microbial species. In this study, we assessed the effect of the enrichment-based sequencing on identifying respiratory infections using tiling RNA probes targeting 76 respiratory pathogens and sequenced using both Illumina and Oxford Nanopore platforms. Forty respiratory swab samples pre-tested for a panel of respiratory pathogens by qPCR were used to benchmark the sequencing data. We observed a general improvement in sensitivity after enrichment. The overall detection rate increased from 73 to 85% after probe capture detected by Illumina. Moreover, enrichment with probe sets boosted the frequency of unique pathogen reads by 34.6 and 37.8-fold for Illumina DNA and cDNA sequencing, respectively. This also resulted in significant improvements on genome coverage especially in viruses. Despite these advantages, we found that library pooling may cause reads mis-assignment, probably due to crosstalk issues arise from post-capture PCR and from pooled sequencing, thus increasing the risk of bleed-through signal. Taken together, an overall improvement in the breadth and depth of pathogen coverage is achieved using enrichment-based sequencing method. For future applications, automated library processing and pooling-free sequencing could enhance the precision and timeliness of probe enrichment-based clinical metagenomics.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-024-75120-x.
Keywords: mNGS, Illumina, Nanopore, Respiratory pathogens, Virus
Subject terms: Biological techniques, Microbiology
Introduction
The techniques available for identifying and characterizing pathogens have evolved greatly over the last 50 years, from cell/bacterial culture as the original “gold-standard”, to molecular techniques such as PCR which allowed same-day results to guide treatment, to finally, the latest next-generation sequencing techniques1,2. Metagenomic next generation sequencing (mNGS) has shown potential to transform diagnostic testing by creating a shotgun method that detects and characterizes all potential pathogens present in a sample. mNGS operates without the need for prior knowledge about the potential pathogens present and can withstand sequence variations that might lead to the failure of specific molecular tests3.
Despite these advantages, the ultra-low ratio of pathogen nucleic acid relative to that of host results in signals inundated by non-informative reads thus restricting its sensitivity. It necessitates novel techniques that enhance sensitivity while concurrently maintaining comprehensive pathogen identification. Indeed, great effort has been made to increase the pathogen sequence yield by introducing sample pretreatment procedures such as low-speed centrifugation or filtration, nucleases treatment to remove unprotected host nucleic acid, or virus concentration by density gradient ultracentrifugation4,5. Each of these procedures may cause bias against certain pathogens due to their varying physicochemical features. Targeted sequence capture, a well-validated technique for enriching specific nucleic acid sequences, offers an alternative strategy for selectively isolating pathogen-derived nucleic acids within a metagenomic sample6. This technique has been used extensively to assess potential pathogens causing meningitis and encephalitis7 and other respiratory viruses8.
This study aimed to develop a comprehensive sequence capture panel for the targeted enrichment and identification of a broad spectrum of respiratory pathogens. By leveraging our previous clinical metagenomic sequencing protocol9, we developed a respiratory pathogen probe-enrichment-based method to enhance the capture of diverse respiratory microbial taxa. This approach utilizes agent-specific probes to selectively capture the template of interest prior to Illumina and Oxford Nanopore sequencing. We further assessed the performance of this approach in comparison with mNGS without enrichment. This work demonstrates a pathway toward more sensitive and reliable metagenomic sequencing workflows to better suit clinical needs.
Results
Design of the mNGS workflow
A probe-enrichment-based method to enhance nucleic acid capture of diverse respiratory microbial taxa was developed (Fig. 1). Sample lysis was performed using chaotropic salt-based buffer in combination with bead beating, followed by magnetic bead-based semi-automatic nucleic acid extraction as described in our previous report9. Part of the extracted TNAs (Total nucleic acids) were used for pre-testing a panel of respiratory pathogens by qPCR and the rest of TNAs were used for library preparation. Illumina sequencing libraries starting from DNA or RNA were generated and sequenced directly by NovaSeq which was defined as the standard metagenomic sequencing (sMS). The rest of the libraries were subjected to in-solution capture enrichment using commercially available probe sets, biotinylated tiling RNA probes (120nt) that cover conserved area of major genera of respiratory pathogens (Human coronaviruses, Adenoviruses, Human bocaviruses, Polyomavirus, Parechoviruses, Rhinoviruses, Enteroviruses, Parainfluenza virus, Metapneumoviruses, Respiratory syncytial virus, Influenza A/B viruses, Cytomegalovirus, Coxsackie viruses, respiratory bacteria species and chlamydia species, for a complete list, see Table S1). The enriched libraries were analyzed by Illumina or Nanopore-based sequencing, defined as enriched metagenomic sequencing (eMS). The metagenomics sequencing data were then taxonomically classified by database searching against NCBI nucleotide sequences (NT database) or curated RVDB viral sequence database.
Fig. 1.
Schematic of the mNGS assay workflow. Total nucleic acids (TNAs) were extracted by bead beating and guanidinium isothiocyanate-based lysis. Part of TNAs were used for qPCR panel test, the rest of TNAs were split into two aliquots for subsequent DNA and cDNA Illumina library preparation, the libraries were either analysed directly by Novaseq (sMS) or enriched by probes and further analysed by Illumina or Nanopore-based sequencing (eMS). Separate DNA and cDNA libraries were constructed and sequenced in this study.
Taxonomic identification provided by sMS and eMS compared to qPCR panel
A total of 40 clinical nasopharyngeal swab samples were used in this study. All samples were pre-tested by a qPCR panel targeting 31 respiratory pathogens as shown in Table S1 with our reported primers10,11, 29 of which were included in 76 probe-targeting respiratory pathogens except for Acinetobacter baumannii and Group A streptococcus. These samples were processed by our sMS-Illumina workflow and further subjected to probe-capture enrichment followed by Illumina and Nanopore sequencing (eMS-Illumina, eMS-Nanopore). Among these 40 samples, 28 were tested as pathogen-positive by qPCR panel with Ct values < 38, and the other 12 samples without detectable signal were used as negative control. Table 1 and Table S2 listed the results generated by the molecular methods. For pathogens identified in multiple samples, only one record was shown in Table 1. The complete taxonomic identifications were shown in Table S2. Among the 28 qPCR-positive samples, 21 samples were positive for at least 2 pathogens. Thus, a total of 55 positive hits were detected by qPCR panel in 28 samples, including 30 hits of bacteria and 25 hits of viruses (2 DNA viruses and 23 RNA viruses) (Fig. 2A and B, Table S2).
Table 1.
Taxonomic identification provided by sMS and eMS compared to qPCR panel.
Sample ID | qPCR panel identification | qPCR Ct | Taxonomic identificaiton by NGS | Pathogens | Pathogen sequence reads per million | Fold increase | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DNA sequencing | cDNA sequencing | DNA sequencing | cDNA sequencing | ||||||||||
sMS by Illumina | eMS by Illumina | eMS by Nanopore | sMS by Illumina | eMS by Illumina | eMS by Nanopore | eMS/sMS by Illumina | eMS/sMS by Illumina | ||||||
1 | Influenza B virus | 34.79 | Influenza B virus | RNA virus | 0 | 0 | 0 | 0 | 140 | 9 | - | > 140a | |
1 | Moraxella catarrhalis | 31.46 | Moraxella catarrhalis BBH18 | bacteria | 1,804 | 69,344 | 28,292 | 1,865 | 30,940 | 15,610 | 38.45 | 16.59 | |
3 | Haemophilus influenzae | 31.90 | Haemophilus influenzae Rd KW20 | bacteria | 464 | 24,231 | 22,098 | 258 | 10,046 | 27,511 | 52.21 | 38.86 | |
3 | Streptococcus pneumoniae | 37.87 | Streptococcus pneumoniae R6 | bacteria | 94 | 5,509 | 31,863 | 159 | 8,051 | 40,847 | 58.63 | 50.66 | |
6 | Influenza A virus H3 | 33.87 | Influenza A virus (A/New York/392/2004(H3N2)) | RNA virus | 0 | 0 | 0 | 0 | 4 | 0 | - | > 4a | |
12 | Acinetobacter baumannii | 36.37 | / | bacteria | 0 | 0 | 0 | 0 | 0 | 0 | - | - | |
14 | Human rhinovirus | 29.59 | Human rhinovirus C | RNA virus | 0 | 0 | 0 | 0 | 374 | 23 | - | > 374a | |
14 | Staphylococcus aureus | 37.50 | Staphylococcus aureus subsp. aureus NCTC 8325 | bacteria | 158 | 15,635 | 27,725 | 1,021 | 33,974 | 46,127 | 98.88 | 33.27 | |
21 | Respiratory syncytial virus | 27.85 | Human_Respiratory_syncytial_virus_9320 | RNA virus | 0 | 0 | 0 | 216 | 11,616 | 1,212 | - | 53.77 | |
22 | Klebsiella pneumoniae | 34.05 | Klebsiella pneumoniae subsp. pneumoniae HS11286 | bacteria | 9 | 54 | 177 | 15 | 120 | 43 | 5.73 | 7.90 | |
24 | Group A streptococcus | 34.84 | / | bacteria | 0 | 0 | 0 | 0 | 0 | 0 | - | - | |
26 | Parainfluenza 2 | 32.06 | Human rubulavirus 2 b | RNA virus | 0 | 0 | 0 | 432 | 129,005 | 19,744 | - | 298.50 | |
31 | Metapneumovirus | 29.56 | Human metapneumovirus | RNA virus | 0 | 0 | 0 | 3 | 4 | 0 | - | 1.35 | |
32 | Parainfluenza 1 | 31.45 | Human parainfluenza virus 1 | RNA virus | 0 | 0 | 38 | 139 | 27,441 | 9,854 | - | 196.97 | |
33 | Human coronavirus 229 e | 29.12 | Human coronavirus 229E | RNA virus | 383 | 425 | 0 | 345 | 511 | 0 | 1.11 | 1.48 | |
34 | Respiratory adenovirus | 26.39 | Human adenovirus B1 | DNA virus | 4,422 | 623,329 | 131,182 | 2,720 | 538,143 | 12,564 | 140.95 | 197.82 |
a Fold increase could not be calculated beause the number of precapture sMS reads was 0.
bHuman rubulavirus 2 and Parainfluenza 2 were the other names of Human rothorubulavirus2 as annotated by NCBI Taxonomy Brower.
Fig. 2.
Taxonomic identification provided by sMS and eMS compared to qPCR panel. A and B, Venn diagrams of pathogen hits identified by mNGS compared to qPCR panel. Of the 30 qPCR-positive bacteria hits, 27 were positively detected by sMS-Illumina, eMS-Illumina or eMS-Nanopore. Of the 25 qPCR-positive viruses hits detected by qPCR panel, 13 were positively detected by sMS-Illumina, another 7 or 6 were further positively identified by eMS-Illumina and eMS-Nanopore.
For the 30 qPCR-positive bacteria hits, 27 of them were identified by sMS-Illumina with reads per million ranging from 7 to 273,846. The other 3 hits that were undetected by sMS-Illumina were Acinetobacter baumannii in sample #11 and 12, and Group A streptococcus in sample #24. The qPCR Ct values of these bacteria were 34.32, 36.37 and 34.84 respectively, close to the detection limit (Ct ≤ 38). The eMS-Illumina and eMS-Nanopore methods also failed to detect these three bacteria. Which is expected as Acinetobacter baumannii and Group A streptococcus were not in the probe-targeting enrichment list (Table S1). For the 2 qPCR-positive DNA virus hits, both were also detected by the sMS-Illumina and eMS- Illumina and Nanopore methods. Of the 23 qPCR-positive RNA virus hits, 10 of them were successfully detected by the sMS-Illumina method, and another 7 (4 hits for Influenza B virus, 1 hit for Influenza A H3, 2 hits for Human rhinovirus) and 6 (4 hits for Influenza B virus, 1 hit for Human rhinovirus, 1 hit for Respiratory syncytial virus) were further identified by eMS-Illumina and eMS-Nanopore respectively, among which 5 hits were overlapped (Table S2, Fig. 2B). These data indicate that enrichment-based sequencing has the capability to detect bacteria similar to standard metagenomic sequencing, and it could enhance sensitivity in detecting viruses.
Not only did our sequencing methods showed high concordance with qPCR assays, but they also enabled the characterization of the pathogens in greater details. First, several bacteria and viruses could be identified at subspecies level as shown in Table 1 and Table S2, including Haemophilus influenzae Rd KW20, Klebsiella pneumoniae subsp. pneumoniae HS11286, Moraxella catarrhalis BBH18, Staphylococcus aureus subsp. aureus NCTC 8325, Streptococcus pneumoniae R6 and Human adenovirus B1, etc. Second, in addition to the pathogens detected by qPCR panel, unbiased mNGS has the capacity to detect additional agents without pre-existing knowledge of the pathogens in samples. Neisseria meningitidis MC58 which is not in the list of qPCR panel, was detected in several samples by sMS-Illumina (Table S4), in which sample 19 and 20 were negative for all the candidates in the qPCR panel. Neisseria meningitidis MC58 was also detected by eMS-Illumina in these samples with increased read depth. The average eMS to sMS RPM (reads per million) fold increase were 24.38. In order to verify the sequencing results, Neisseria meningitidis MC58 was subsequently confirmed with a targeted PCR assay (Figure S1A). The qPCR Ct values of these samples were negatively correlated to the log10RPM data of Nesseria meningitidis MC58 detected either by sMS-Illumina or eMS-Illumina strategies (Figure S1 B and C). These findings demonstrate the capacity of mNGS in discovering multiple coinfections in clinical samples with unbiased identities, a key advantage compared with qPCR.
Enrichment NGS aids in the detection of diverse pathogens
To evaluate the enrichment efficiency, probe capture on the 40 enrolled patient sample libraries were performed and sequenced by Illumina both before and after capture. To compare enrichment of pathogen content across sequencing runs, we used a normalized count reads per million to correct for differences in sequencing depth. Comparing eMS with sMS, we observed consistent and substantial increases in the sequence reads per million obtained for each pathogen genome before and after probe-based enrichment (shown in Table 1 and Table S2). Capture with probe sets enriched unique pathogen reads/million 34.6 and 37.8-fold on average for DNA and cDNA sequencing respectively. The average sequence reads by eMS-Illumina were 0.67 and 1.03-fold compared to sMS-Illumina for DNA and cDNA libraries respectively (Figure S2). This demonstrates that probe capture significantly enhances the sensitivity of detecting respiratory pathogens, with minimal contribution from sequencing depth. To further confirm the influence of sequencing depth on measuring the probe enrichment effect, the DNA or cDNA sequencing data of each sample were randomly subsampled into minimum number of sequencing reads obtained with or without probe enrichment. The results indicated that pathogen reads captured with probe sets were enriched for 35.2-fold and 37.0-fold on average for DNA and cDNA sequencing, respectively (Table S3), which were very close to the results from the RPM data. This further proved that the RPM data was reliable in evaluating the probe enrichment effect. Reads plot (RPM) of representative pathogens before and after probe enrichment detected were shown in Fig. 3.
Fig. 3.
Reads plot of representative pathogens detected by sMS- or eMS-Illumina sequencing. By using the pathogen nucleic acid capture probes, the reads/million (RPM) of pathogen sequences were plotted. *p < 0.05; **p < 0.01; ***p < 0.001(paired t-test using Graphpad Prism 5 software).
Probe-based enrichment sequencing has improved our ability to detect pathogens. As mentioned above, although 55 hits in 40 samples were found by qPCR, 15 (3 bacteria and 12 RNA viruses) in 55 were not detected in standard metagenomic sequencing (Table S2) With probe capture, additional 7 pathogen hits in 7 samples were tested as positive in eMS-Illumina, including Human rhinovirus (samples #14, #22), Influenza B virus (samples #1, #5, #7, #28) and Influenza A virus H3 (samples #6) whereas no virus reads were obtained when without capture (Table 1 and Table S2). Compared with the qPCR panel, the detection rate increased from 73% (40/55) to 85% (47/55) after probe capture detected by Illumina. For Streptococcus pneumoniae (detected in 13 clinical samples #2, #3, #4, #5, #6, #7, #13, #14, #21, #24, #25, #30 and #31, Table S2), capture considerably increased the number of reads detected in all 13 samples with average RPM fold change of 19.7 (Table S2 and Fig. 3A). Klebsiella pneumoniae and Haemophilus influenzae were also detected in several samples and their average RPM were increased by 9.8 and 23.6 folds, respectively (Table S2, Fig. 3B, C). Influenza B virus was positive in six samples by qPCR, but sMS only detected influenza B virus in three samples. With the capacity of eMS, detection of Influenza B virus in the other four samples were enabled. And the average RPM before and after probe enrichment were 15 and 1414, respectively(Table S2 and Fig. 3D). These data demonstrate that probe-enrichment based sequencing improves the sensitivity and detection rate of pathogens.
Enrichment MS aid in higher coverage of virus genomes
Probe capture enrichment sequencing could in turn improve genome assembly of detected pathogens, especially for detection of viruses in samples. By probe-based enrichment process, several detected viruses have higher sequencing coverage compared with sMS-Illumina (Table 2). Sequencing coverage of Respiratory syncytial virus (sample #21), Human rubulavirus 2 (sample #26) and Human parainfluenza virus 1 (sample #32) were 56.59%, 21.29% and 55.49% by sMS-Illumina, and were improved to 80.66%, 99.98% and 99.10% respectively by eMS-Illumina (Fig. 4A, B and C; Table 2). The average coverage depth of Respiratory syncytial virus, Human rubulavirus 2 and Human parainfluenza virus 1 detected in the samples #21, #26 and #32 were 2.03, 0.54 and 1.94 by sMS-Illumina, and were raised to 189.90, 244.4 and 448.10 by eMS-Illumina. Further, the sequencing coverage by eMS-Illumina for Influenza B virus (samples #7, #26 and #33), and Human adenovirus B1 (sample #34) were above 99% (Table 2). The high level of enrichment and coverage enables facile assembly of genome sequences. These results demonstrated that probe-based enrichment of respiratory pathogen genomes could be used to improve viral detection and genome assembly in an outbreak.
Table 2.
eMS aid in higher coverage of virus genomes.
Sample ID | Taxonomic identificaiton by NGS | Library type | sMS-Illumina | eMS-Illumina | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
NO. of mapped Reads | Cov. (%) | Ave cov. depth | Max cov. depth | NO. of mapped Reads | Cov. (%) | Ave cov. depth | Max cov. depth | ||||
7 | Influenza B virus | RNA | 0 | 0 | 0.00 | 0 | 214 | 91.09 | 11.4 | 40 | |
21 | Respiratory syncytial virus | RNA | 282 | 56.59% | 2.03 | 13 | 21,552 | 80.66 | 189.9 | 1,884 | |
26 | Human rubulavirus 2 | RNA | 84 | 21.29% | 0.54 | 10 | 28,828 | 99.98 | 244.4 | 1,084 | |
26 | Influenza B virus | RNA | 12 | 39.82% | 0.66 | 3 | 1,632 | 100.0 | 90.2 | 323 | |
28 | Influenza B virus | RNA | 0 | 0 | 0.00 | 0 | 196 | 62.92 | 9.3 | 45 | |
32 | Human parainfluenza virus 1 | RNA | 286 | 55.49% | 1.94 | 18 | 53,101 | 99.10 | 448.1 | 2,401 | |
33 | Influenza B virus | RNA | 4 | 8.15% | 0.18 | 4 | 864 | 100.0 | 48.4 | 179 | |
34 | Human adenovirus B1 | DNA | 22,004 | 99.62% | 85.40 | 161 | 3,450,404 | 99.62 | 14,420.7 | 132,407 |
Abbreviations in Table 2: cov.: sequencing coverage.
Fig. 4.
Performance of different sequencing workflows on selected respiratory pathogenic viruses. (A, B and C) Genomic coverage plot of Respiratory syncytial virus, Human rubulavirus 2 and Human parainfluenza virus 1, respectively.
Comparison of eMS-Illumina and eMS-Nanopore
We also compared the performance of the enriched sequencing data generated by Nanopore and Illumina. For DNA libraries, all the bacteria and DNA viruses detected by eMS-Illumina were also detected by eMS-Nanopore with reads/million varies from 93 to 419,650 (Table S2). There was no significant difference between the RPM data of eMS- Illumina and Nanopore in detecting bacteria (p = 0.22). For DNA viruses, only Human adenovirus B1 was detected in samples #25 and #34. Both eMS-Illumina and eMS-Nanopore exhibited high read frequencies for detecting this virus (Table S2). For cDNA libraries, 11 in 17 RNA virus hits detected by eMS-Illumina were also detected by eMS-Nanopore. Overall, these data suggest that eMS-Illumina and eMS-Nanopore have similar capacity in detecting bacteria and viruses.
Pooling of barcoded libraries before probe-capturing increases false-positivity
Despite successful enhancement in the detected agents, we also noticed some irregularities in the eMS results. In this study, samples #21 ~ 40 were analyzed in the same batch of experiment. Six to seven Illumina DNA or cDNA libraries were pooled into one probe-capturing hybridization and further amplified by PCR before sequencing. Human adenovirus B1 was detected in sample #25 and #34 by qPCR with Ct value 30.54 and 26.39, which was also positively detected by sMS-Illumina with 78 and 22,101 total mapped reads, respectively (Table S5). However, the eMS-Illumina based methods detected reads from Human adenovirus B1 in sample #21–24, 30–33, 37, 40 (2-10163 total reads) which are PCR negative. We reasoned that these reads should be resulted from the pooled-amplification step after enrichment, as sample #30, 37, 40 share index N508 with sample #34, sample #21 ~ 24 share index N704 with sample #25, and sample #31 ~ 33 share index N706 with #34 (for detailed index arrangement, see Table S5). In comparison, only sample #30 showed 2 reads of adenovirus in sMS-Illumina data (without enrichment) which shares index N508 with sample #34 suggesting low-level of signal crosstalk at the Illumina sequencing stage. With additional cycles of amplification after probe enrichment, the impact of signal bleed-through within a multiplexed enrichment and sequencing run becomes much more significant.
Discussion
At present, various forms of molecular diagnosis mostly based on the polymerase chain reaction are still the primary assays in the clinic for detection of infectious agents12. PCR typically maintains an edge in terms of sensitivity, cost-effectiveness, and ease of use. Nonetheless, false negative results can arise due to the restricted range of commonly employed PCR techniques in detecting variations of established pathogens. In addition, unknown or previously unrecognized disease-relevant pathogens cannot be assayed due to unavailability of specific assays. By contrast, metagenomic sequencing is a relatively unbiased method. Sequencing reads are classified based on similarity to reference genomes and targets or assembled de novo, which are largely unconstrained by pre-existing knowledge of the pathogens. mNGS surpasses PCR in several aspects, including independence from specific primer and template sequences, the ability to detect a wide range of agents, the potential to acquire complete viral genome sequences, and the accuracy in identifying new species or strains13,14. Beyond pathogen identification, mNGS provides valuable insights into clinically significant genomic characteristics such as taxonomic classification, drug resistance patterns, and virulence factors15.
In this study, 40 clinical nasopharyngeal swab samples pre-tested by a panel of qPCR assays were used to assess our standard mNGS workflow and enrichment-based sequencing workflow. Our data demonstrate that the standard mNGS sequencing methods exhibit good concordance with qPCR assays especially for bacteria, as 27 in 30 (90%) bacteria were positively detected by sMS-Illumina while its capacity in identify small genome viruses were lower, as 13 in 25 viruses (52%) were detected. The small genome size of viruses and/or low levels of virus in the sample makes sMS detection of viruses more challenging. In addition, mNGS could enable characterization of the pathogens in greater details. Several bacteria and viruses were identified at subspecies level. Metagenomic sequence also demonstrate the capacity in revealing multiple coinfections in clinical samples and unbiased identification of pathogens not in the list of qPCR panel. Furthermore, our sequencing data revealed the existence of Neisseria meningititis with high read counts in some of the respiratory samples. Although this bacterium is considered to be part of the normal flora, substantial evidence has been obtained regarding its pathogenic role in the respiratory tract particularly when a viral co-infection is ongoing16,17.
The integration of capture-based sequencing assays significantly enhances the effectiveness of high-throughput sequencing in detecting pathogens18. Metagenomic libraries underwent hybridization using capture ‘bait’ probes. Our study demonstrates that employing probe sets for capture improves the detection and retrieval of pathogen genomes, particularly viruses, in clinical samples, while maintaining the original complexity of the target samples accurately. Compared with the qPCR panel, eMS by Illumina has similar sensitivity to that obtained with targeted real-time PCR (85% concordance rate). Capture with these probe sets enriched unique pathogen reads for 34.6 and 37.8-fold in average for DNA and RNA respectively. This also led to dramatic improvements on genome coverage especially for viruses. Probe capture increased the pathogen content in the sequencing data and enabled the accessibility of metagenomic sequencing on smaller-capacity platforms such as the MiniON (Oxford Nanopore). Comparison of the data generated by these two sequencing methods, similar capacity in detection bacteria and viruses were found.
Although enrichment sequencing has great potential, its application may face practical ramifications. Probe-based capture introduces bias favoring specific microorganisms, resulting a difficult compromise between assay sensitivity and impartiality. The choice regarding the use of probe-capture should thus be contingent on the clinical scenario. Unbiased metagenomics will be preferred if a rare manifestation and uncommon exposure history is presented while enrichment sequencing will be desirable for common symptoms with no definite etiology. Enrichment sequencing also entails additional procedures, higher expenses, and lengthy hybridization times (lasting from 2 to 16 h) due to the extra processing required for maximal efficiency. Notably, as amplification process involves multiple cycles, and therefore, computational evaluations must accurately address the possibility of duplicate reads. Furthermore, it’s probable that confounding factors, such as contamination during sample handling, can lead to erroneous identification and raise false positive rate.
Possible false positive result may also be caused by irregularities during post-capture PCR as 6–7 samples were pooled, enriched and amplified. Additionally, false positive may also arise due to “index hopping” during Illumina sequencing. Index hopping represents the primary reason for inaccuracies in assigning sequencing reads to the correct samples within multiplexed pooled libraries. As the practice of pooling barcoded DNA from multiple samples in a single lane of high-throughput sequencers becomes standard in metagenomic next-generation sequencing (mNGS) experiments, this multiplexing approach can lead to misidentification of certain demultiplexed sequencing reads. The key factor contributing to index hopping has been identified as the presence of free-floating indexing primers that bind to pooled DNA fragments just before the bridge amplification step within patterned sequencing flow cells19. The post enrichment library amplification is found to worsen this as observed in this study, hence should be avoided in the future. The avoidance of sample pooling in both library preparation and sequencing step by using a relatively low-output and real-time sequencing device for individual specimen could be a good solution in the clinical setting.
In summary, probe-enrichment based eMS shows promise as an enhanced diagnostic approach. This method exhibits sensitivity comparable to targeted real-time PCR, offering the additional advantage of detecting viral variants that would evade specific PCR assays. Furthermore, it holds the potential to deliver the complete viral genome sequence, facilitating the evaluation of viral diversity and evolution within the context of epidemiological and public health applications. To further unleash its power, additional efforts should be made in streamlining its library preparation process, minimizing cross-contamination in sample handling, hybridization and sequencing.
Methods
Ethics statement
The ethics committee of Shanghai Public Health Clinical Center approved this study. All research protocols involving humans were in accordance with the guidelines of the Declaration of Helsinki, 1964. Informed consent was obtained from all enrolled patients.
Sample collection and study subjects
This is a retrospective study based on a respiratory pathogen surveillance program. From 2019 to 2020, throat swabs samples were collected from sentinel clinics in Shanghai and tested for 31 respiratory pathogens using qPCR (Table S1) with our reported primers10,11. All steps of the nucleic acid extraction and qPCR test were conducted in parallel with negative controls. The positive control for each assay was established by purifying and quantifying the DNA target for each set of primers. The qPCR panel was performed against the target gene with the human RNase P target as the internal control of the sample quality. A total of 28 samples which covered the major infectious agents and 12 samples which were tested negative against all these agents were included in this study. Samples #1 ~ 20 and samples #21 ~ 40 were analysed in two separated NGS batches with 6 negative samples in each batch.
Nucleic acid extraction
Total nucleic acids (TNAs) were extracted using a semi-automatic method according to our previously protocol9. Briefly, 400 µL of 1.5× guanidinium isothiocyanate (GITC) lysis buffer (6 M GITC, 75 mM Tris-HCl, pH 8.0, 3% sarkosyl, 30 mM EDTA) was added to 200 µL sample. The lysates were homogenized with glass beads and subjected to magnetic beads-based TNAs extraction. The whole extraction process takes approximately 1 h, with approximately 20 min of hands-on time. The purified TNAs were split into aliquots for subsequent DNA and cDNA library preparation.
Standard metagenomic sequencing (sMS)-Illumina DNA library preparation
For DNA libraries, we used a Tn5 transposase-based tagmentation method (TruePrep DNA Library Prep Kit V2 for Illumina, TD503-02, Vazyme) followed by PCR (15 cycles, 98 °C 15 s, 60 °C 30 s, 72 °C 3 min) with indexed primers (TruePrep Index Kit V2, Vazyme) according to the manufacturer’s recommendation.
Standard metagenomic sequencing (sMS)-Illumina cDNA library preparation
First, the cDNA library was prepared based on SHERRY method20. Briefly, TNAs were first reverse transcribed as follows. In a total volume of 10 µL, 4 µL of TNAs was mixed with 0.5 µL of N5 random primer (10 µM), 2 µL of 5× Maxima H Minus RT Buffer, 1 µL of MgSO4 (100 mM) and 0.25 µL of Recombinant RNase Inhibitor (40 U/µL, TaKaRa), denatured at 65 °C for 5 min and then immediately placed on ice. Then, 1 µL of dNTP mix (10 mM), 0.5 µL of Maxima H Minus reverse transcriptase (200 U/µL, Thermo Fisher), 0.25 µL of RNase Inhibitor and, 0.5 µL RNAse free water were added. Reverse transcription was carried out by incubating at 25 °C for 10 min and 50 °C for 30 min, followed by inactivation by incubation at 85 °C for 5 min.
Then, eight microliters of RNA-DNA hybrid were then used for Tn5-mediated tagmentation using the TruePrep DNA Library Prep Kit V2 (TD503-02, Vazyme) with a modified buffer composition, in which 9% PEG and 0.85mM ATP were supplemented to achieve tagmentation of the RNA strand at 55 °C for 30 min. Afterwards, 0.5 µL BST3.0 DNA polymerase, 1 µL TAE and a pair of N5 and N7 indexed primers were used to perform DNA strand extension at 72 °C for 15 min followed by 98 °C for 30 s. The library was further amplified 15 cycles with the following program: 98 °C 15s, 60 °C 30s, 72 °C 3 min.
Standard metagenomic sequencing (sMS)-Illumina library cleanup, quantification, and sequencing
The amplified DNA and cDNA libraries were purified using VAHTS DNA Clean Beads (Vazyme) at 1:1 ratio (v/v), and finally eluted in 20 µL of nuclease-free water. The purified products were quantified using the dsDNA HS Assay Kit on a Qubit 3.0 Fluorometer (Thermo Fisher). Sequencing of DNA and cDNA libraries were performed on a NovaSeq 6000 with a 2 × 150-bp paired-end sequencing protocol.
Enriched metagenomic sequencing (eMS)-Illumina sequencing
Commercially available probe sets targeting 76 respiratory pathogens (listed in Table S1) and enrichment kits (Dynasty Gene, Shanghai, China) were employed in this study. The RNA probes (120 nt) used in this study were designed according to the principles of high-density tiling and GC boosting strategy using the reference sequences of the 76 pathogens (listed in Table S1)from NCBI database. We performed in-solution hybridization and capture according to manufacturer instructions with modifications. Six or seven individual Illumina DNA or cDNA libraries (30ng for each sample) prepared as mentioned above were pooled for one hybridization were performed at 60℃ for 2–16 h. The hybridized DNA was captured on magnetic streptavidin beads and thoroughly washed, which was further amplified with universal Illumina PCR primers (P7: 5’-CAAGCAGAAGACGGCATACGA-3’; P5: 5’-AATGATACGGCGACCACCGA-3’) for 15 cycles. The amplified products were purified by VAHTS DNA Clean Beads (Vazyme) which were subsequently sequenced on NovaSeq 6000 with a 2 × 150-bp paired-end sequencing protocol.
Enriched metagenomic sequencing (eMS)-nanopore sequencing
For eMS-Nanopore sequencing, the probe-enriched Illumina RNA or DNA libraries were further processed by the SQK-LSK109 kit (Oxford Nanopore) combine with the Native barcoding amplicons (EXP-NBD114). Approximately 50ng (about 200 fmol) of probe enriched Illumina library was used for Nanopore library construction according to the manufacturer’s instructions. Each sample was labelled with a unique barcode provided in the EXP-NBD114 kit. Libraries were then sequenced on the MinION platform using R9 flow cells for up to 48 h.
Data analysis
Data analysis was mostly performed according to our previously publication9. Identified RNA viruses were reported based on analysis of cDNA mNGS libraries, whereas DNA viruses, bacteria, fungi, and parasites were reported based on analysis of a DNA or cDNA library, depending on the abundance of the pathogen-mapped reads. For Illumina data, adapters and low-quality reads were first removed by Fastp v 0.20.021. Human sequences were removed using bowtie2 v 2.3.522 and samtools v 1.923. The remaining reads were taxonomically classified by centrifuge v 1.0.424 against the viral genomes or NCBI nucleotide sequences (NT database, 98 GB). These reads were also mapped against the curated RVDB viral sequence database25 using bowtie2 (v2.3.5). For some specific species, their reference sequences were manually downloaded from NCBI genome and sequencing reads were mapped against them. Genome alignments and coverage were visualized using Tablet (v19.09.03)26. Major statistics such as the numbers of filtered reads, the number of reads aligned to the species-specific sequence, the number of mapped reads per million filtered reads, genome coverage (%) and coverage depth (average and maximum) were collected.
For Nanopore sequencing data, raw FAST5 files from the MinION instrument were basecalled by Guppy (v 3.2.4). Basecalled FASTQ files were processed by filtlong software (v0.2.0) for removal of low-quality (q > 7) and short (less than 100). Demultiplexing the Nanopore sequence was done using porechop (https://github.com/rrwick/Porechop) in two rounds. The first round uses the default adapter sequences for the Nanopore SQK-LSK109 kit. In the second round, Illumina adapter sequences were demultiplexed by editing the adapters.py script in the software. The resulting reads were mapped against curated RVDB viral sequence database using minimap2 (v 2.17-r941). Mapped reads were exported to bam files using Samtools and visualized using Tablet.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
The authors acknowledge funding received from the following sources: the 2020 Shanghai Science and Technology Innovation Action Plan Medical Innovation Research Special Project(20Y11911300), three-Year Initiative Plan for Strengthening Public Health System Construction in Shanghai (2023–2025) (GWVI-11.1-09, GWVI-2.2, GWVI-11.1-15), the Shanghai Hospital Development Center Foundation (SHDC12022121), the National Natural Science Foundation of China (grant no. 81801991), the National Science and Technology Major Project of China (2017ZX10103009-001, 2018ZX10305409-001-005). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author contributions
X.Z., Z.Y. and L.Y. conceived the study. X.J., M.W., L.P., C.Y. and W.W. performed the experiments. W.W. collected the clinical samples. X.Z. perform the bioinformatic analysis of sequencing data. X.J. and X.Z. drafted the manuscript, and all authors reviewed and approved the manuscript.
Data availability
The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive27 in National Genomics Data Center28, China National Center for Bioinformation / Beijing Institute of Genomics, Chinese Academy of Sciences (GSA-Human: HRA006904) that are publicly accessible at https://ngdc.cncb.ac.cn/gsa-human/browse/HRA006904.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Xiaofang Jia and Wei Wang.
Contributor Information
Zhigang Yi, Email: zgyi@fudan.edu.cn.
Xiaonan Zhang, Email: Xiaonan.Zhang@canberra.edu.au.
References
- 1.Mitchell, A. B. & Glanville, A. R. Introduction to techniques and methodologies for characterizing the human respiratory virome. Methods Mol. Biol.1838, 111–123 (2018). [DOI] [PubMed] [Google Scholar]
- 2.Chiu, C. Y. & Miller, S. A. Clinical metagenomics. Nat. Rev. Genet.20, 341–355 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Graf, E. H. et al. Unbiased detection of respiratory viruses by use of RNA sequencing-based metagenomics: A systematic comparison to a commercial PCR panel. J. Clin. Microbiol.54, 1000–1007 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Goya, S. et al. An optimized methodology for whole genome sequencing of RNA respiratory viruses from nasopharyngeal aspirates. PLoS One13, e0199714 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Li, L. et al. Comparing viral metagenomics methods using a highly multiplexed human viral pathogens reagent. J. Virol. Methods213, 139–146 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Metsky, H. C. et al. Capturing sequence diversity in metagenomes with comprehensive and scalable probe design. Nat. Biotechnol.37, 160–168 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Piantadosi, A. et al. Enhanced virus detection and metagenomic sequencing in patients with meningitis and encephalitis. mBio12, e0114321 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.O’Flaherty, B. M. et al. Comprehensive viral enrichment enables sensitive respiratory virus genomic identification and analysis by next generation sequencing. Genome Res.28, 869–877 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Jia, X. et al. A streamlined clinical metagenomic sequencing protocol for rapid pathogen identification. Sci. Rep.11, 4405 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Li, Z. J. et al. Etiological and epidemiological features of acute respiratory infections in China. Nat. Commun.12, 5026 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Song, W. et al. Acute respiratory infections in children, before and after the COVID-19 pandemic, a sentinel study. J. Infect.85, 90–122 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gu, W., Miller, S. & Chiu, C. Y. Clinical metagenomic next-generation sequencing for pathogen detection. Annu. Rev. Pathol.14, 319–338 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Miller, S. et al. Laboratory validation of a clinical metagenomic sequencing assay for pathogen detection in cerebrospinal fluid. Genome Res.29, 831–842 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Schlaberg, R. et al. Validation of metagenomic next-generation sequencing tests for universal pathogen detection. Arch. Pathol. Lab. Med.141, 776–786 (2017). [DOI] [PubMed] [Google Scholar]
- 15.Wylie, K. M. et al. Detection of viruses in clinical samples by use of metagenomic sequencing and targeted sequence capture. J. Clin. Microbiol.56, (2018). [DOI] [PMC free article] [PubMed]
- 16.Singer, R. et al. The increase in invasive bacterial infections with respiratory transmission in Germany, 2022/2023. Dtsch. Arztebl Int.121, 114–120 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Putsch, R. W., Hamilton, J. D. & Wolinsky, E. Neisseria meningitidis, a respiratory pathogen?. J. Infect. Dis.121, 48–54 (1970). [DOI] [PubMed] [Google Scholar]
- 18.Jain, K. et al. Development of a capture sequencing assay for enhanced detection and genotyping of tick-borne pathogens. Sci. Rep.11, 12384 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Farouni, R., Djambazian, H., Ferri, L. E., Ragoussis, J. & Najafabadi, H. S. Model-based analysis of sample index hopping reveals its widespread artifacts in multiplexed single-cell RNA-sequencing. Nat. Commun.11, 2704 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Di, L. et al. RNA sequencing by direct tagmentation of RNA/DNA hybrids. Proc. Natl. Acad. Sci. U S A117, 2886–2893 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chen, S., Zhou, Y., Chen, Y. & Gu, J. Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics34, i884–i890 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Langdon, W. B. Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks. BioData Min.8, 1 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics25, 2078–2079 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: Rapid and sensitive classification of metagenomic sequences. Genome Res.26, 1721–1729 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Goodacre, N., Aljanahi, A., Nandakumar, S., Mikailov, M. & Khan, A. S. A reference viral database (RVDB) to enhance bioinformatics analysis of high-throughput sequencing for novel virus detection. mSphere3, (2018). [DOI] [PMC free article] [PubMed]
- 26.Milne, I. et al. Using tablet for visual exploration of second-generation sequencing data. Brief. Bioinform.14, 193–202 (2013). [DOI] [PubMed] [Google Scholar]
- 27.Chen, T. et al. The genome sequence archive family: Toward explosive data growth and diverse data types. Genom. Proteom. Bioinform.19, 578–583 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Members, C. N. Partners, database resources of the national genomics data center, China National Center for Bioinformation in 2024. Nucleic Acids Res.52, D18–D32 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive27 in National Genomics Data Center28, China National Center for Bioinformation / Beijing Institute of Genomics, Chinese Academy of Sciences (GSA-Human: HRA006904) that are publicly accessible at https://ngdc.cncb.ac.cn/gsa-human/browse/HRA006904.