Abstract
Metagenomic sequencing is increasingly being used in clinical settings for difficult to diagnose cases. The performance of viral metagenomic protocols relies to a large extent on the bioinformatic analysis. In this study, the European Society for Clinical Virology (ESCV) Network on NGS (ENNGS) initiated a benchmark of metagenomic pipelines currently used in clinical virological laboratories.
Methods
Metagenomic datasets from 13 clinical samples from patients with encephalitis or viral respiratory infections characterized by PCR were selected. The datasets were analysed with 13 different pipelines currently used in virological diagnostic laboratories of participating ENNGS members. The pipelines and classification tools were: Centrifuge, DAMIAN, DIAMOND, DNASTAR, FEVIR, Genome Detective, Jovian, MetaMIC, MetaMix, One Codex, RIEMS, VirMet, and Taxonomer. Performance, characteristics, clinical use, and user-friendliness of these pipelines were analysed.
Results
Overall, viral pathogens with high loads were detected by all the evaluated metagenomic pipelines. In contrast, lower abundance pathogens and mixed infections were only detected by 3/13 pipelines, namely DNASTAR, FEVIR, and MetaMix. Overall sensitivity ranged from 80% (10/13) to 100% (13/13 datasets). Overall positive predictive value ranged from 71-100%. The majority of the pipelines classified sequences based on nucleotide similarity (8/13), only a minority used amino acid similarity, and 6 of the 13 pipelines assembled sequences de novo. No clear differences in performance were detected that correlated with these classification approaches. Read counts of target viruses varied between the pipelines over a range of 2-3 log, indicating differences in limit of detection.
Conclusion
A wide variety of viral metagenomic pipelines is currently used in the participating clinical diagnostic laboratories. Detection of low abundant viral pathogens and mixed infections remains a challenge, implicating the need for standardization and validation of metagenomic analysis for clinical diagnostic use. Future studies should address the selective effects due to the choice of different reference viral databases.
Introduction
Viral metagenomic next-generation sequencing (mNGS) is increasingly being used in virology laboratories for the diagnosis of patients with suspected but unexplained infectious diseases. The current main clinical application of viral metagenomics is for diagnosing encephalitis of unknown cause [1, 2], but metagenomic sequencing is considered useful in a growing number of other clinical syndromes [3–6]. Although many wet-lab challenges need to be faced as well [14], the performance of metagenomic methods is largely dependent on accurate bioinformatic analysis, and both classification algorithms and databases are crucial factors determining the overall performance of the pipelines [7] [55]. A wide range of metagenomic pipelines and taxonomic classifiers have been developed, commonly for the purpose of biodiversity studies analysing the composition of the microbiome in different cohorts. In contrast, when applying metagenomics to patient diagnostics, potential false-negative and false-positive bioinformatic classification results can have significant consequences for patient care. Most reports on bioinformatic tools for metagenomic analysis for virus diagnostics typically describe algorithms and validations of single in-house developed pipelines developed by the authors themselves [8–12]. Most reports on bioinformatic tools for metagenomic analysis for virus diagnostics typically describe algorithms and validations of single in-house developed pipelines developed by the authors themselves [13], and recently a metagenomic benchmarking trial among Swiss virology laboratories has been conducted [7]. Recently, ESCV Network on NGS (ENNGS) recommendations for the introduction of next-generation sequencing in clinical virology, part II: bioinformatic analysis and reporting were published [55], aiming to address the challenges involved. While a professional External Quality Assessment (EQA) program is currently in preparation by Quality Control for Molecular Diagnostics (QCMD), the ENNGS [14] [55] conducted the presented benchmark of bioinformatic pipelines of the participating diagnostic laboratories using viral metagenomic datasets derived from clinical samples, in order to assist laboratories with selection and optimization of tools to be implemented for clinical use.
Methods
Datasets
To exclude differences in wet-lab procedures, the same raw, untrimmed metagenomic datasets were provided, so that the participants had standardized datasets for bioinformatic analysis.
In total, 13 clinical metagenomic datasets from samples well-characterized by RT-PCR [15–18] were selected from patients with encephalitis or respiratory complaints, including: cerebrospinal fluid (CSF, n=4), brain biopsies (n=3), nasopharyngeal swabs (n=3), nasal washings (n=1), bronchoalveolar lavage (n=1), and a plasma sample (n=1). RT-PCR panel results and Cq-values are included in the result section. The pathogens in the 13 datasets are depicted in Table 2.
Table 2. Qualitative and quantitative results: raw sequence read count categories of the PCR positive viruses reported by the metagenomic pipelines using datasets from 13 clinical samples, per classification tool (complete pipeline details can be found in table 1).
Encephalitis | Respiratory disease | Fever | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Samples | 1 CSF | 2 CSF Capture probes | 3 CSF Capture probes | 4 CSF Capture probes | 5 Brain biopsy | 6 Brain biopsy | 7 Brain biopsy | 8 NP swab | 9 NP swab | 10 NP swab | 11 BAL (mixed infection) | 12 Nasal wash | 13 Plasma (mixed infection) | ||
PCR (Cq-value/c/ml) | HHV-6 (25.9) | HHV-6 (24.6) | Entero virus (26.3) | EBV (29.1/3.8 log10) | Mumps (23,8) | CoV-OC43 (24) | Astrovirus VA1 (25) | Inf-A (24.8) | PIV-3 (31.5) | CoV-NL63 (28.6) | CoV-NL63 (24.2) | CoV-HKU-1 (28.3) | HKU-1 (24.4) | Adeno-virus (28.8/5 log10) | EBV (32.8/3.9 log10) |
Centrifuge | |||||||||||||||
DAMIAN | |||||||||||||||
DIAMOND | |||||||||||||||
DNASTAR | |||||||||||||||
FEVIR | |||||||||||||||
Genome Detective | |||||||||||||||
Jovian | |||||||||||||||
MetaMIC | |||||||||||||||
MetaMix | |||||||||||||||
One Codex | |||||||||||||||
RIEMS | |||||||||||||||
Taxonomer | |||||||||||||||
VirMet | |||||||||||||||
Legend (read count) | ND | 101 | 102 | 103 | 104 | 105 | 106 |
CSF; cerebrospinal fluid, NP; nasopharyngeal, BAL; bronchoalveolar lavage, and in legend: ND; not detected
For samples processed at the Great Ormond Street Hospital, London (GOSH), mRNA from the three brain biopsy samples was sequenced on an Illumina NextSeq500 instrument using an 81 bp paired-run after library preparation using Illumina’s TruSeq Stranded mRNA LT sample preparation kit (p/n RS-122-2101) according to the manufacturer’s instructions [19]. The other samples were spiked with Equine Arteritis Virus (EAV) and Phocid Herpes Virus (PhHV) internal controls preceding total nucleic acid extraction using the MagNAPure 96 DNA and Viral NA Small Volume Kit (Roche Diagnostics, Almere, the Netherlands) and sequenced on Illumina NextSeq500 (respiratory samples) or NovaSeq6000 (CSF samples, plasma) instruments using 150 bp paired-end runs after library preparation using New England BioLabs’ NEBNext Ultra Directional RNA Library preparation kit for Illumina with in-house adaptations in order to enable simultaneous detection of both DNA and RNA viruses, at the Leiden University Medical Center (LUMC) [4, 20]. Three of the CSF samples were sequenced after enrichment using capture probes targeting vertebrate viruses [21]. Human reads from the output FASTQ files were removed after mapping them to human reference genome GRCh38 [22] with Bowtie2 version 2.3.4 [23] before the datasets were uploaded to various data sharing platforms (see below).
Data sharing
The FASTQ datasets were and remain publicly available for user-friendly downloading at https://veb.lumc.nl/CliniMG (hosted by the dept. MM, LUMC, Leiden), and part of the datasets were additionally accessible via a COMPARE Data Hub at http://www.ebi.ac.uk/ena/pathogens (hosted by the European Bioinformatics Institute, EMBL-EBI) [24].
Bioinformatic pipelines
The datasets were analysed in a blinded fashion by the participants, with the (viral) metagenomic pipelines and classification tools (Figure 1 and Table 1) used at their diagnostic laboratories: Centrifuge [25], DAMIAN [26, 27], DIAMOND [28], DNASTAR [29], FEVIR [30], Genome Detective [31], Jovian [32], MetaMIC [33], MetaMix [34, 35], One Codex [36], RIEMS [37, 38], Taxonomer [39], and VirMet [40]. DAMIAN was run by two participants in combination with a different database (pipeline A and B), and one participant run both Centrifuge and GenomeDetective. Details of the algorithms are described in Table 1.
Table 1. Clinical use, classification and output characteristics of metagenomic pipelines analysed.
Pipeline no (alphabetic al order) | 1 | 2 | 3 and 3A | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Classification tool or pipeline | Centrifuge [25] v1.0.1-beta | DAMIAN [26, 27] v190628 | DIAMOND [28] v0.9.13.114 | DNASTAR [29] Lasergene v16 | FEVIR [30] V1 | Genome Detective [31] v1.110 | Jovian [32] V0.9.6 | MetaMIC [33] v2.1.1 | MetaMix [34] [35] v1.2 | One Codex [36] v1 | RIEMS [37, 38] v4.0 | Taxonomer [39] 2020-U | VirMet [40] v1.1.1 |
Clinical usage by participant | Patient care | Experimental | Experimental | Patient care | Patient care | Patient carea | Patient care | Patient carea | Patient carea | Experimental | Patient care | Experimental | Patient care |
In-house/commercially available | Open-source software | Open-source software | In-house | Commercial | In-house | Commercial | In-house | In-house | In-house | Commercial | In-house | Commercial | In-house |
Local/web-based | Local | Local | Local | Local (cloud optional) | Local | Web-based | Local | Local | Web-based (hosted on Bluebee) | Web-based | Local | Web-based | Local |
De novo assembly | N | Y | Y [45] | N (optional) | N | Y | Y | Y | Y | N | Y | N | N |
Alignment of NT/ AA | NT | NT, AA | AA | NT | NT | NT, AA | NT | NT | AA | NT | NT | NT, AA | NT |
Database used by participant viral/bacterial (version) | Viruses, bacteria, fungi, archae; RefSeq (compressed index) V2019-04-04 | NCBI’s nt and nr v2019-05-17, PFAM 30.0 | Viruses; NCBI’s non-redundant protein database/pipeline 3B: NCBI’s nt V0.9.22 | Based on NCBI’s nt V2020-01-08 | Viruses; based on Virosaurus [46] v90v_2018_11 | Viruses; based on RefSeq (filtering: Swissprot Uniref 90) v2018-11-14 | NCBI’s nt, v2019-11-30 a.o. (compressed index) | Bacteria, viruses, fungi; v2.1.1 based on NCBI’s nt (complete) | Human, Environmental, bacteria, Viruses RefSeq protein v2017 | Bacteria, viruses, fungi, archaea, protozoa, One Codex DB v2019-5-1 | NCBI’s nt (complete) v2019-3-16 | Based on NCBI’s nt (v May 2019) | Based on NCBI’s nt (selection of viral full genomes without compressio n) v224 |
Paired reads as input option | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | N | N |
Trimming and QC tools | Trimmomatic | Trimmomatic | Trimmomatic [47] | Included in DNASTAR | Included in virusscan [48] 1.0 | Trimmomatic, fastqc, multiqc | HMN Trimmer | TrimGalore!, prinseq | Cutadapt | Roche 454 newbler | N | Seqtk, prinseq | |
Exclusion of human reads | N | Y | N/pipeline 3B:Y b | Y | N | Y | Y | Y | Y | Y | N | N | Y |
Output type | Script, Krona | Excel | Script, krona | User-friendly | Script | Web interface, interactive | Web interface, interactive | Web interface, interactive, PDF and excel | Web interface, interactive, PDF and excel | Web interface, interactive | |||
Visualization of genome coverage | N | N | N | Y | N | Y | N | N | N (CLC Genomics Workbench) | N | Y | N (free version)/Y (paid) | Y |
Computational time for analysis per sample (CPU/RAM) | ~10 min (24 CPUs, 0.3 GB RAM) | 90 min (56 CPUs, 125 GB RAM) | 90min (36 CPUs, RAM=23 gb) | 3 min (4 CPUs, 64 GB RAM | 1 min (176 CPUs, 384 GB RAM) | ~10 min (web-based) | 12 min (20 CPUs, U GB RAM) | 18 min (U CPUs, U GB RAM) | 60-180 min (web-based) | 35-40 min (web-based) | U (48 CPUs, 768 GB RAM) | 10 min (web-based) | 20 min (16 CPUs, 64 GB RAM) |
Cut-off for defining positive result used | ≥15 reads [20] | Contig length => 400 bp | 1 read | 1 read | ≥300 nt coverage | ≥3 regions, distributed [49] | U | Above background : environmental sample | ≥3 regions >10 reads [50] [51] Probability score | 1 read | 1 read | 1 read | ≥3 reads, distributed, >100x than NC/other samples |
Confirmatory analysis required for clinical reporting | BLAST, PCR | PCR for clinical cases | BLAST, PCR for clinical cases | BLAST, PCR for clinical cases | BLAST | U | U | PCR not required (based on validation) [52–54] | BLAST, coverage (PCR not required based on validation) | Not required | BLAST, PCR | BLAST | U |
Within the scope of accreditation
Mapping of the trimmed reads to HG38 by Bowtie with “very-sensitive” option
AA; amino acid, NT; nucleotide, U; Undisclosed
Performance characteristics
Both qualitative and quantitative performance of the pipelines were analysed with real-time PCR results as gold standard. The following parameters available for all pipelines were considered: pathogen detection, taxonomic classification level and target read count. Additionally, horizontal genome coverage (if available), computational time, user-friendliness and output formats were considered. Since EAV and PhHV were added as internal controls and not reported by the participants (due to default reporting criteria, or absence in the database)they were not included in the comparative analysis.
Results
Metagenomic pipeline characteristics
In total 13 different metagenomic pipelines and classification tools were in use in the 13 participating diagnostic laboratories. Clinical use, classification and output characteristics of the pipelines and tools utilized are shown in Figure 1 and Table 1. The majority of the pipelines were developed or adapted at a local site, while four pipelines were commercially available and web-based: DNASTAR (Madison, WI, USA), Genome Detective (Emweb bv, Herent, Belgium), One Codex (San Francisco, USA), and Taxonomer (Utah, USA). DAMIAN and Centrifuge are publicly available as an open source software. Both classification tools and reference databases differed among participants (and were fixed for end-users of the commercially available pipelines); (adapted versions of) NCBI’s nucleotide and RefSeq databases were most commonly used to generate reference databases. Six of the 13 pipelines assembled sequence reads de novo, whereas the others classified unassembled reads. The majority of the pipelines classified reads based on nucleotide similarity (8/13), and a minority used amino acid similarity (2/13), or a combination of both (3/13 pipelines). Parameters used by the participants for defining a positive result were the number of virus reads, horizontal genome coverage (some of the participants), and a cut-off based on posterior-probability scores of the species presence (MetaMix) and ROC-curves. Output formats varied, the majority had a user-friendly output format: excel, PDF or interactive webpage. Examples of these user-friendly output formats are shown in Supplementary Figure S1.
Detection of PCR targeted viral pathogens; sensitivity
The qualitative and quantitative results of the pipeline benchmarking for viruses detected by RT-PCR are shown in Table 2 and Figure 2. Overall, higher abundance viral pathogens (Cq-value < 28) were detected by all metagenomic pipelines evaluated. In contrast, viral pathogens with RT-PCR Cq-value of 28 and higher including mixed virus infections were only detected by 3/13 pipelines, namely DNASTAR, FEVIR, and MetaMix. Although participants analysed the same FASTQ files, read counts of the target viruses varied from one to several orders of magnitude across pipelines. Also, read counts (all datasets combined) achieved by participants did not correlate well with the viral load as measured by RT-PCR (R=-0.07, P-value 0.5), however it must be noted that wet lab procedures varied per set of samples, including protocols with and without viral enrichment, which had potential impact on the viral read counts and thus on correlation with Cq-values. Overall sensitivity of the pipelines at sample level was 77% (10/13) - 100% (13/13 samples, mixed infections counted as one) (Table 2 and Supplementary Table 2). At viral mNGS hit level, overall sensitivity was 80% (12/15) - 100% (15/15 viral hits) (Supplementary Table 4). One of the participants reported normalized reads including the genome length, using the following formula: RPKM = (number of reads mapped to virus genome Y * 106) / (total number of reads * length of genome in kp). This formula was also used to normalize the reads of all study pipelines shown in Figure 2.
Taxonomic level of classification
The taxonomic levels of classification and typing of pathogenic viruses by the metagenomic pipelines with the settings used and reported by the participants are shown in Figure 3 and Supplementary Table 3. The classification level is dependent on the database used, algorithm settings (classification of reads to the lowest common ancestor, LCA, in case of multiple hits), and the participant’s default reporting levels based on either in-house validation data or clinical relevancy. Species level classification was the most common level reported. Serotype and strain level were identified by tools that were combined with NCBI’s nt database without the LCA setting. DAMIAN was the only tool to report classification at the isolate level.
For the Adenovirus sample (#13), virus types reported were not consistent between different pipelines: human Adenovirus type 31 (DIAMOND, Jovian, DNASTAR, VirMet), type 12 (DAMIAN), type 31 or 61 (metaMIC), indicating that type classification was not always correct. Type 12 and 31 are both from subgroup A Adenoviruses, whereas type 61 is a type 31 recombinant virus.
Additional virus hits and positive predictive value
Additional viruses, either not tested for by RT-PCR or RT-PCR negative were reported by 11 out of 13 pipelines, and in one or more samples (Supplementary Table 4). The following additional viruses were reported by multiple pipelines and absent in the negative run control (dataset not available for the participants): human retrovirus RD114 (2-2102 reads, up to 28% genome coverage), feline leukemia virus (2-1406 reads), torque-teno virus (TTV) (18-66 reads, up to 7% genome coverage), polyomaviruses (5-41 reads, up to 37% genome coverage), Bovine viral diarrhea virus (BVDV) (6-220 reads, likely FBS contaminants), human metapneumovirus (HMPV) (15-21 reads, 9% genome coverage), human rhinovirus (HRV) (2-4 reads, up to 5% genome coverage), human parainfluenzavirus-4 (PIV-4) (2-6 reads) and Dengue virus (18-370 reads). RT-PCR data were available for some of the additional viruses detected (Supplementary Table 4). When considering viral mNGS hits with negative RT-PCR results: CoV-NL63 (1 read), PIV-4 (2-6 reads), HRV-C (2-4 reads), CoV-OC43 (5 reads), INF-B (2 reads), the positive predictive value ranged from 71-100% (Figure 4). It must be noted that for these mNGS hits, no distinction could be made between assignments of sequences genuinely present e.g. by index hopping (which was suspected given the low number of reads), false negative by PCR due to primers/probes mismatches, and false positive assignments. When considering the mNGS findings without available RT-PCR results, retrovirus RD114, leukemia viruses, TTV, and polyomaviruses sequences may actually be present given their association with the host (integrated or commensal).
Reporting criteria
Reporting criteria used by the participants are shown in Table 1: a threshold for number of reads, for genome coverage (number of nucleotides and proportion of the genome, or a certain number of genome regions covered), based on reference or in-house validation studies. A BLAST analysis of matching sequences was commonly used by the study participants to exclude false positive (or to confirm true positive) hits. Some participants indicated that for clinical samples outside of the current benchmark, they required a confirmatory PCR before reporting while others indicated that this was not needed based on experiences from their validation studies.
Discussion
This study aimed to benchmark the combination of bioinformatic tools and databases currently in use in diagnostic virology laboratories from the ESCV ENNGS network. The data presented here support bioinformatic selection and optimization of software for the implementation of viral metagenomic sequencing for pathogen detection in clinical samples. To our knowledge, this is the first large-scale international benchmarking study using datasets from clinical samples and pipelines currently applied in a large series of clinical diagnostic laboratories.
The study showed that the pipelines of all the participating laboratories succeeded in detecting viral pathogens with relative high viral loads (Cq-values <28), whereas lower abundant pathogens and mixed infections were only detected by some of the pipelines, namely DNASTAR, FEVIR, and MetaMix. These results are in line with other reports [7]. With regard to mixed infections, the less abundant viruses were generally missed, possibly due to the low number of reads, or reporting considerations. For the missed CoV-HKU1 virus, potential primer cross-reactivity with CoV-NL63 viruses was excluded by in silico analysis. The databases used in the pipelines were mostly custom-made, based on either NCBI’s RefSeq [41] or nt database [42]. All of the participants used different classification tools, though no selection of laboratories using different tools was made in advance. Given the inclusion of different types of pipelines including commercially available ones with fixed databases, it was not feasible to compare the different tools with one standardised database at the local sites. Two of the three pipelines that reached 100% sensitivity included NCBI’s nt database but this was also seen using a pipeline with NCBI’s RefSeq database. Pipelines with NCBI’s nt database scored both low and maximum precision. The design did allow for comparison of the complete pipeline in use for clinical diagnostics, from QC to reporting algorithms including posterior probability scores. No clear differences were observed in terms of performance based on nucleotide-based classification versus amino acid-based classification and de novo assembly-based algorithms versus read based classification: whereas amino-acid based classification may be more sensitive for detecting variants, two of the three pipelines with 100% sensitivity used nucleotide-based classification (DNASTAR, FEVIR). High precision was reached by pipelines that used de novo assembly but this was not essential: 3/8 pipelines with 100% precision did not use de novo assembly (Centrifuge, Taxonomer, One Codex).
Reported read counts and genome coverage varied between pipelines up to several orders of magnitude (for read counts), explaining in part the differences observed in limits of detection for samples with very low viral load. Possibly, differences in reporting of unique versus non-uniquely mapped sequence reads may be related to this difference. Sensitivity and positive predictive value were measured, conveniently avoiding the proportion of true negative findings given the immense but unknown number of negative mNGS hits without RT-PCR data needed for specificity calculations. This aspect remains a limitation intrinsically linked to mNGS validations with clinical datasets. Datasets from negative matrix samples would have been of use for specificity calculations. Positive predictive value calculations were hampered by the intrinsic inability to distinguish between sequences actually present in the dataset that might be undetected by RT-PCR because of, for instance, primer mismatches, index hopping or contaminant sequences introduced during library preparation.This may partially be overcome by defining mNGS consensus results as alternative golden standard, however in diagnostic settings e.g. index hopping reads should not be labelled positive despite being actually present in the dataset. A study design using synthetic data for example spiked into real-data mimicking real-life situations would enable accurate estimation of the specificity and PPV whilst taking into consideration interfering real-life factors such as sequences with switched indices and ‘kitome’ sequences, present in every single dataset. Future studies could address specificity analysis using artificial datasets that take into account the index hopping phenomenon.
It is important to note that participants likely have optimized their interpretation algorithm including cut-offs for their specific workflow from library preparation to sequencing. A different wet lab procedure (sequencer with or without index hopping, preparation with or without probe enrichment) will require new validation and indexing of the determined cut-off values and probability values. Because this was a dry lab comparison exercise, the participants could not follow their routine wet lab workflow and confirmatory PCR steps, which may have affected the reporting of results. Therefore no conclusions can be drawn on the limit of detection of the full metagenomic workflows used in each specific laboratory. The two labs that distributed the datasets may have been advantageous with regard to the analyses since the cut-offs for defining a positive result likely have been adjusted to the specific wet lab procedure used. However, the scores from the pipelines in use in these labs were not higher for the specific datasets delivered, in contrast, all pipelines detected all samples distributed by lab A, and other participants were more successful in detection some of the most challenging samples distributed by lab B.
Genome coverage and depth was not always taken into account by the participating laboratories, however can be an effective parameter to distinguish between (PCR-)contaminants, often indicated by high depth at a small (PCR amplicon) region of the genome, and true positives [21, 55]. In five of participating laboratories a cut-off of one single read was chosen for defining a positive mNGS result. While potentially at higher risk of reporting false positive results, the PPV of these pipelines ranged from 72 up to 100%, indicating that this cut-off was dependent on the overall steps of the analysis and reporting. ROC analysis was used to find the optimal balance between sensitivity and specificity [20].
Finally, our taxonomic results are in line with data available from other groups [43]: the pipelines performed well at species level but deeper level classification was subject to less reliable classification in some cases.
In conclusion, a wide variety of viral metagenomic pipelines with overall high sensitivity are currently used in the ESCV ENNGS participating clinical diagnostic laboratories. Detection of low abundance viral pathogens or mixed infections remains a challenge, implicating the need for standardization and validation of metagenomic analysis for clinical diagnostic use [44]. The algorithm for defining positive results and rejecting false positive results is critical and should be evaluated individually for every workflow, which includes genome extraction, library preparation, sequencer and bioinformatic pipeline. Identification of deeper taxonomic levels is challenging, dependent on the individual types present in the reference database, and should be validated separately to prevent misidentification.
Supplementary Material
Acknowledgements
We thank the COMPARE study group (https://www.compare-europe.eu/) and the EMBL-EBI (https://www.ebi.ac.uk/) for the availability of the Data Hubs.
Funding
MH was supported by the Clinical Research Priority Program ‘Comprehensive Genomic Pathogen Detection’ of the University of Zurich
FXL receives funding from Instituto de Salud Carlos III, Spain (Grant numbers PI18/01824 and PI18/01759 and CIBEResp).
Footnotes
Author contributions: Conceptualization: JV, EC. Methodology: JV, EC, FXL, NF, IS, BO, SB. Data analysis: JV, JuB, NF, IS, SM, JH, BO, AS, AB, CR, GG, EK, LB, CB, JK, SC, FL, DS, MB, DH, MH, VK, MZ, AL, AP. Visualization: JV, AL. First draft: JV. Reviewing and editing: all authors
Conflict of interest: none
Contributor Information
Jutte J.C. de Vries, Email: jjcdevries@lumc.nl.
Julianne R. Brown, Email: julianne.brown@gosh.nhs.uk.
Nicole Fischer, Email: nfischer@uke.de.
Igor A. Sidorov, Email: I.A.Sidorov@lumc.nl.
Sofia Morfopoulou, Email: sofia.morfopoulou.10@ucl.ac.uk.
Jiabin Huang, Email: j.huang@uke.de.
Bas B. Oude Munnink, Email: b.oudemunnink@erasmusmc.nl.
Arzu Sayiner, Email: arzu.sayiner@deu.edu.tr.
Alihan Bulgurcu, Email: alihanbulgurcu@gmail.com.
Christophe Rodriguez, Email: christophe.rodriguez@aphp.fr.
Guillaume Gricourt, Email: guillaume.gricourt@aphp.fr.
Els Keyaerts, Email: els.keyaerts@kuleuven.be.
Leen Beller, Email: leen.beller@kuleuven.be.
Claudia Bachofen, Email: claudia.bachofen@uzh.ch.
Jakub Kubacki, Email: jakub.kubacki@uzh.ch.
Samuel Cordey, Email: Samuel.Cordey@hcuge.ch.
Florian Laubscher, Email: Florian.Laubscher@hcuge.ch.
Dennis Schmitz, Email: Dennis.Schmitz@RIVM.nl.
Martin Beer, Email: martin.beer@fli.de.
Dirk Hoeper, Email: dirk.hoeper@fli.de.
Michael Huber, Email: huber.michael@virology.uzh.ch.
Verena Kufner, Email: kufner.verena@virology.uzh.ch.
Maryam Zaheri, Email: zaheri.maryam@virology.uzh.ch.
Aitana Lebrand, Email: aitana.lebrand@sib.swiss.
Anna Papa, Email: annap@auth.gr.
Sander van Boheemen, Email: s.vanboheemen@erasmusmc.nl.
Aloys C.M. Kroes, Email: A.C.M.Kroes@lumc.nl.
Judith Breuer, Email: breuej@gosh.nhs.uk.
F. Xavier Lopez-Labrador, Email: F.Xavier.Lopez@uv.es.
Eric C.J. Claas, Email: E.C.J.Claas@lumc.nl.
References
- 1.Brown JR, Bharucha T, Breuer J. Encephalitis diagnosis using metagenomics: application of next generation sequencing for undiagnosed cases. J Infect. 2018;76(3):225–240. doi: 10.1016/j.jinf.2017.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wilson MR, et al. Clinical Metagenomic Sequencing for Diagnosis of Meningitis and Encephalitis. N Engl J Med. 2019;380(24):2327–2340. doi: 10.1056/NEJMoa1803396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Jerome H, et al. Metagenomic next-generation sequencing aids the diagnosis of viral infections in febrile returning travellers. J Infect. 2019;79(4):383–388. doi: 10.1016/j.jinf.2019.08.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.van Boheemen S, et al. Retrospective Validation of a Metagenomic Sequencing Protocol for Combined Detection of RNA and DNA Viruses Using Respiratory Samples from Pediatric Patients. J Mol Diagn. 2020;22(2):196–207. doi: 10.1016/j.jmoldx.2019.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lewandowska DW, et al. Metagenomic sequencing complements routine diagnostics in identifying viral pathogens in lung transplant recipients with unknown etiology of respiratory infection. PLoS One. 2017;12(5):e0177340. doi: 10.1371/journal.pone.0177340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kufner V, et al. Two Years of Viral Metagenomics in a Tertiary Diagnostics Unit: Evaluation of the First 105 Cases. Genes (Basel) 2019;10(9) doi: 10.3390/genes10090661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Junier T, et al. Viral Metagenomics in the Clinical Realm: Lessons Learned from a Swiss-Wide Ring Trial. Genes (Basel) 2019;10(9) doi: 10.3390/genes10090655. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Chen J, Huang J, Sun Y. TAR-VIR: a pipeline for TARgeted VIRal strain reconstruction from metagenomic data. BMC Bioinformatics. 2019;20(1):305. doi: 10.1186/s12859-019-2878-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Miller S, et al. Laboratory validation of a clinical metagenomic sequencing assay for pathogen detection in cerebrospinal fluid. Genome Res. 2019;29(5):831–842. doi: 10.1101/gr.238170.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Paez-Espino D, et al. Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data. Nat Protoc. 2017;12(8):1673–1682. doi: 10.1038/nprot.2017.063. [DOI] [PubMed] [Google Scholar]
- 11.Li Y, et al. VIP: an integrated pipeline for metagenomics of virus identification and discovery. Sci Rep. 2016;6:23774. doi: 10.1038/srep23774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Nooij S, et al. Overview of Virus Metagenomic Classification Methods and Their Biological Applications. Front Microbiol. 2018;9:749. doi: 10.3389/fmicb.2018.00749. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Brinkmann A, et al. Proficiency Testing of Virus Diagnostics Based on Bioinformatics Analysis of Simulated In Silico High-Throughput Sequencing Data Sets. J Clin Microbiol. 2019;57(8) doi: 10.1128/JCM.00466-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lopez-Labrador FX, B JR, Fischer N, Harvala H, Van Boheemen S, Cinek O, Sayiner A, Vasehus Madsen T, Auvinen E, et al. Recommendations for the introduction of metagenomic high-throughput sequencing in clinical virology, part I: wet lab procedure. J Clin Virol. 2020 Dec; doi: 10.1016/j.jcv.2020.104691. [DOI] [PubMed] [Google Scholar]
- 15.Kalpoe JS, et al. Validation of clinical application of cytomegalovirus plasma DNA load measurement and definition of treatment criteria by analysis of correlation to antigen detection. J Clin Microbiol. 2004;42(4):1498–504. doi: 10.1128/JCM.42.4.1498-1504.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Read SJ, Kurtz JB. Laboratory diagnosis of common viral infections of the central nervous system by using a single multiplex PCR screening assay. J Clin Microbiol. 1999;37(5):1352–5. doi: 10.1128/jcm.37.5.1352-1355.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lankester AC, et al. Epstein-Barr virus (EBV)-DNA quantification in pediatric allogenic stem cell recipients: prediction of EBV-associated lymphoproliferative disease. Blood. 2002;99(7):2630–1. doi: 10.1182/blood.v99.7.2630. [DOI] [PubMed] [Google Scholar]
- 18.Loens K, et al. Performance of different mono-and multiplex nucleic acid amplification tests on a multipathogen external quality assessment panel. J Clin Microbiol. 2012;50(3):977–87. doi: 10.1128/JCM.00200-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Morfopoulou S, et al. Deep sequencing reveals persistence of cell-associated mumps vaccine virus in chronic encephalitis. Acta Neuropathol. 2017;133(1):139–147. doi: 10.1007/s00401-016-1629-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.van Rijn AL, et al. The respiratory virome and exacerbations in patients with chronic obstructive pulmonary disease. PLoS One. 2019;14(10):e0223952. doi: 10.1371/journal.pone.0223952. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Carbo EC, et al. Improved diagnosis of viral encephalitis in adult and pediatric hematological patients using viral metagenomics. J Clin Virol. 2020;130:104566. doi: 10.1016/j.jcv.2020.104566. [DOI] [PubMed] [Google Scholar]
- 22. [Accessed July 2020]. https://www.ncbi.nlm.nih.gov/assembly/GCF000001405.26/
- 23.Langmead B, SLS Fast gapped-read alignment with Bowtie 2. Nat Methods and hdon. 2012 Apr;9(4):357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Amid C, et al. The COMPARE Data Hubs. Database-the Journal of Biological Databases and Curation. 2019:1–14. doi: 10.1093/database/baz136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kim D, et al. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26(12):1721–1729. doi: 10.1101/gr.210641.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Alawi M, et al. DAMIAN: an open source bioinformatics tool for fast, systematic and cohort based analysis of microorganisms in diagnostic samples. Sci Rep. 2019;9(1):16841. doi: 10.1038/s41598-019-52881-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. https://sourceforge.net/projects/damian-pd .
- 28.Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12(1):59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
- 29. https://www.dnastar.com/software/lasergene/
- 30.Fernandes JF, et al. Unbiased metagenomic next-generation sequencing of blood from hospitalized febrile children in Gabon. Emerg Microbes Infect. 2020;9(1):1242–1244. doi: 10.1080/22221751.2020.1772015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Vilsker M, et al. Genome Detective: an automated system for virus identification from high-throughput sequencing data. Bioinformatics. 2019;35(5):871–873. doi: 10.1093/bioinformatics/bty695. http://www.genomedetective.com . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. https://github.com/DennisSchmitz/Jovian .
- 33.Rodriguez C, et al. Pathogen identification by shotgun metagenomics of patients with necrotizing soft-tissue infections. Br J Dermatol. 2019 doi: 10.1111/bjd.18611. [DOI] [PubMed] [Google Scholar]
- 34.Morfopoulou S, Plagnol V. Bayesian mixture analysis for metagenomic community profiling. Bioinformatics. 2015;31(18):2930–8. doi: 10.1093/bioinformatics/btv317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. https://cran.r-project.org/web/packages/metaMix/index.html .
- 36.Minot SS. One Codex: a sensitive and accurate data platform for genomic microbial identification. bioRxiv. 2015 [Google Scholar]
- 37.Scheuch M, Hoper D, Beer M. RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets. BMC Bioinformatics. 2015;16:69. doi: 10.1186/s12859-015-0503-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. https://github.com/EBI-COMMUNITY/fli-RIEMS .
- 39.Flygare S, et al. Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling. Genome Biol. 2016;17(1):111. doi: 10.1186/s13059-016-0969-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. https://github.com/medvir/VirMet and https://github.com/medvir/shiny-server/tree/master/NGS/VirMetRunAnalysis.
- 41.O’Leary NA, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733-45. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Benson DA, et al. GenBank. Nucleic Acids Res. 2011;39(Database issue):D32-7. doi: 10.1093/nar/gkq1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Sczyrba A, et al. Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software. Nat Methods. 2017;14(11):1063–1071. doi: 10.1038/nmeth.4458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Bharucha T, et al. STROBE-metagenomics: a STROBE extension statement to guide the reporting of metagenomics studies. Lancet Infect Dis. 2020;20(10):e251–e260. doi: 10.1016/S1473-3099(20)30199-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Nurk S, et al. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27(5):824–834. doi: 10.1101/gr.213959.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. https://viralzone.expasy.org/8676 .
- 47.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. https://github.com/sib-swiss/virusscan .
- 49.Carbo EC, B E, Karelioti E, Sidorov I, Feltkamp MCW, Von dem Borne PA, Verschuuren Jan JGM, Kroes ACM, Claas ECJ, De Vries JJC. Improved diagnosis of viral encephalitis in adults and pediatric hematological patients using viral metagenomics. bioRxiv. 2020 doi: 10.1016/j.jcv.2020.104566. [DOI] [PubMed] [Google Scholar]
- 50.Mongkolrattanothai K, Dien Bard J. The utility of direct specimen detection by Sanger sequencing in hospitalized pediatric patients. Diagn Microbiol Infect Dis. 2017;87(2):100–102. doi: 10.1016/j.diagmicrobio.2016.10.024. [DOI] [PubMed] [Google Scholar]
- 51.Kawada J, et al. Identification of Viruses in Cases of Pediatric Acute Encephalitis and Encephalopathy Using Next-Generation Sequencing. Sci Rep. 2016;6:33452. doi: 10.1038/srep33452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Rodriguez C, et al. Pathogen identification by shotgun metagenomics of patients with necrotizing soft-tissue infections. Br J Dermatol. 2020;183(1):105–113. doi: 10.1111/bjd.18611. [DOI] [PubMed] [Google Scholar]
- 53.Rodriguez C, et al. Fatal Measles Inclusion-Body Encephalitis in Adult with Untreated AIDS, France. Emerg Infect Dis. 2020;26(9):2231–2234. doi: 10.3201/eid2609.200366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Rodriguez C, et al. Fatal Encephalitis Caused by Cristoli Virus, an Emerging Orthobunyavirus, France. Emerg Infect Dis. 2020;26(6):1287–1290. doi: 10.3201/eid2606.191431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.De Vries JJC, et al. Recommendations for the introduction of next-generation sequencing in clinical virology, part II: bioinformatic analysis and reporting. J Clin Virol. 2021:104812. doi: 10.1016/i.icv.2021.104812. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.