Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Nov 16;25(6):bbae597. doi: 10.1093/bib/bbae597

MetaAll: integrative bioinformatics workflow for analysing clinical metagenomic data

Martin Bosilj 1, Alen Suljič 2, Samo Zakotnik 3, Jan Slunečko 4, Rok Kogoj 5, Misa Korva 6,
PMCID: PMC11568877  PMID: 39550223

Abstract

Over the past decade, there have been many improvements in the field of metagenomics, including sequencing technologies, advances in bioinformatics and the development of reference databases, but a one-size-fits-all sequencing and bioinformatics pipeline does not yet seem achievable. In this study, we address the bioinformatics part of the analysis by combining three methods into a three-step workflow that increases the sensitivity and specificity of clinical metagenomics and improves pathogen detection. The individual tools are combined into a user-friendly workflow suitable for analysing short paired-end (PE) and long reads from metagenomics datasets—MetaAll. To demonstrate the applicability of the developed workflow, four complicated clinical cases with different disease presentations and multiple samples collected from different biological sites as well as the CAMI Clinical pathogen detection challenge dataset were used. MetaAll was able to identify putative pathogens in all but one case. In this case, however, traditional microbiological diagnostics were also unsuccessful. In addition, co-infection with Haemophilus influenzae and Human rhinovirus C54 was detected in case 1 and co-infection with SARS-Cov-2 and Influenza A virus (FluA) subtype H3N2 was detected in case 3. In case 2, in which conventional diagnostics could not find a pathogen, mNGS pointed to Klebsiella pneumoniae as the suspected pathogen. Finally, this study demonstrated the importance of combining read classification, contig validation and targeted reference mapping for more reliable detection of infectious agents in clinical metagenome samples.

Keywords: bioinformatics workflow, clinical metagenomics, pathogen detection, short PE reads, long reads

Introduction

Metagenomic next-generation sequencing (mNGS) is revolutionizing the understanding of the diversity and distribution of microbial populations in different environments [1]. Due to the indiscriminate generation of millions of nucleic acid sequences, mNGS has the potential to detect multiple microorganisms at once, including novel pathogens that may be present in a clinical sample [2]. Therefore, it should be suitable for the identification of all pathogens, including variants that differ from typical polymerase chain reaction (PCR) amplification targets, pathogens whose association with a particular clinical syndrome is unknown, and novel pathogens that cannot be detected by target-based methods [3–5]. Although multiplex PCR was developed to improve on the diagnostic shortcomings of singleplex PCR by enabling the simultaneous detection of multiple targets in a single reaction, it still requires prior knowledge of the pathogens of interest, which limits its application [6]. Increasingly, there are publications showing that mNGS is adequately sensitive and can detect various bacteria, fungi and viruses [7–9]. Nevertheless, mNGS has proven invaluable in the diagnosis of unexplained pneumonia and diseases of unknown origin [10, 11]. Moreover, in two recent studies, it was the only method that successfully identified the causative pathogen [12, 13].

In this regard, the detection of RNA viruses represents a particular challenge. The emergence of new viral species due to antigenic shift and drift is a continuous, never-ending process, and human-animal interactions which are an important source of emerging pathogens are increasing [14, 15]. Due to the diversity of viruses and their ability to rapidly evolve and adapt, the discovery and classification of viral sequences is challenging [16]. Viral metagenomics can help identify new and emerging viruses and understand the dynamics of viral infections in different environments [17–19]. The information gained from such studies can not only benefit patients but also contribute to the development of public health strategies for the prevention and treatment of viral diseases, e.g., to identify new targets for antiviral drugs or to develop vaccines [20–22].

The entire mNGS process involves challenges in both ‘wet-’, and ‘dry-lab’ procedures. The scope of the process, from sample collection to final reporting, is an indication of the complexity of the analysis. The intricate composition of clinical material, which may contain a high proportion of host cells and microorganisms from the natural flora, in contrast to the potentially low amount of pathogens, requires customized laboratory treatment of samples and isolation of nucleic acids [23]. Metagenomic sequencing has proven useful when using either short paired-end (PE) or long reads. Short PE reads provided by Illumina platforms have been shown to be suitable for the detection of clinical pathogens, etiologic agents of infectious diseases, rare or emerging pathogens [24–26], but are limited by the need for large numbers of samples to achieve affordable costs and relatively long turnaround time. In contrast, Oxford Nanopore Technologies (ONT) enables rapid NGS library preparation and real-time sequencing and is therefore suitable for rapid detection of pathogens, which is critical in acute clinical conditions [6, 27–29], but is limited by the capacity and concentration requirements of nucleic acids.

In addition to these obstacles, metagenomic bioinformatics analysis presents its own challenges due to the large amount of data, selection of the right tools and databases, and appropriate use of available parameters, all of which affect the final result. Consequently, the selection of appropriate tools and databases for pathogen detection is crucial for reliable metagenomic analysis [30–33]. Nevertheless, when properly applied, bioinformatics contributes significantly to the outcome by facilitating the identification of known and novel pathogens, comparative analyzes and functional annotations, and improving our understanding of microbial diversity, evolution and pathogenesis in different clinical and environmental contexts [3]. The great interest in the application of metagenomic sequencing in various research fields, including public health and diagnostics, has led to the development of numerous bioinformatics contig classification workflows such as DIAMOND + MEGAN and DAMIAN [33–37]. Despite this important contribution, the question remains whether we can trust a single method to reliably detect clinically relevant pathogens in the samples tested.

Read classification method allows us to obtain a more comprehensive picture of the microbial composition of the sample. However, accurate classification of short reads remains difficult as they may not provide enough information to distinguish between very similar genomes, leading to misclassification or ambiguity in the assignment of taxonomic labels [38]. Contig classification method provides more comprehensive genomic fragments, but carries the risk of generating chimeric contigs due to sample heterogeneity, and there will always be a proportion of reads that are not assembled [38, 39]. Reference mapping method allows us to look at the mapping statistics and obtain information about the coverage of genomic regions [40]. Detection of novel or poorly characterized microbial species is a limitation, as the method requires a well-defined reference genome or database to function properly [41].

In this study, a three-step mNGS bioinformatics workflow suitable for the detection of microbial pathogens in various clinical samples (MetaAll) is presented and its applicability is investigated using real clinical metagenomics samples. It combines three methods: read classification, contig classification and reference genome mapping. The workflow was developed in order to improve the reliability of mNGS results and tested on different sequencing platforms with different clinical cases. The results demonstrate the importance of combining multiple bioinformatics methods for a more efficient and reliable clinical pathogen detection.

Methods

Design of the customized metagenomics workflow—MetaAll

For the analysis of the sequencing data, customized workflow was developed using the Snakemake Workflow Management System (v7.22.0) and Singularity Containers (v3.9.0) [42, 43]. Figure 1 shows a brief illustration of the combined methods used to improve pathogen detection. For the detection of microbial pathogens, three approaches were combined in a three-step workflow: taxonomic classification of reads, taxonomic classification of contigs and read mapping for classification validation. The workflow contains a configuration file ‘config.yml’ in which the parameter values can be changed. More detailed descriptions of the individual parameters are contained in the same file. The statistical software R (version 4.3.1, R Foundation for Statistical Computing, Vienna, Austria) was used to visualize the results.

Figure 1.

Figure 1

Workflow pipeline in the form of a diagram with the corresponding bioinformatics tools, organized according to the data of the various sequencing technologies.

Data pre-processing

Short PE reads were quality checked with FastQC (v0.11.9) [44] and MultiQC (v1.14) [45] and trimmed with BBDuk (v.39.01) using the following parameters ‘ktrim=r k=27 mink=5 qtrim=rl trimq=12 overwrite=t’ to remove low quality reads and ‘minlen = 60’ to remove reads shorter than 60 bp [46]. For host depletion, reads were mapped to the human genome assembly GRCh38 (hg38) using bowtie2 (v2.5.0) [47].

For long raw reads, the quality of reads was checked using NanoPlot (v1.41.0) and NanoComp (v1.20.0) [48]. Adapter sequences were removed with Porechop_ABI (v0.5.1) [49] using the default parameters. Too short and reads with poor quality were filtered out with NanoFilt (v2.8.0) using the parameters ‘-q 10 -l 100’ [48]. To provide a lower computational load, host removal (hg38) was performed using minimap2 (v2.24) [50].

Read classification method

For initial inspection of the filtered reads, KrakenUniq (v1.0.2) was applied to perform a taxonomic classification with the following parameters: ‘—preload-size 100G’ to enable 100 GB for loading the database into memory in size blocks, and ‘—check-names’ to ensure that each pair of reads has names that match each other. The implemented default database has an index size of 377 GB and was downloaded from https://benlangmead.github.io/aws-indexes/k2, with a setting date of 6/16/2022 [32]. KronaTools (v2.8.1) [51] and Pavian (v1.0) [52] were used to visualize and present the results. The same tools and databases were used for both short PE and long reads. Classification reliability was assessed according to the following criteria: clinical relevance of the pathogens detected, the number of classified reads and the number of unique k-mers. By taking number of classified reads and the number of unique k-mers into account, KrakenUniq can often discern false-positive from true-positive matches. According to the results of the KrakenUniq developers, the species-level detection threshold was set at >15 classified reads and > 1000 unique k-mers [32]. Classification results that did not meet these criteria were classified as false positives.

Contig classification method

The assembly of filtered short PE reads into contigs was performed with metaSPAdes (v.3.15.4) [53]. After assembly, contigs shorter than 300 bp were removed according to the minimum length setting ‘-L 300’ with a seqtk tool (v1.3, https://github.com/lh3/seqtk). In order to achieve a more comprehensive classification, the method was carried out in three phases, each using a different approach. In the first classification phase, viralVerify (v1.1, https://github.com/ablab/viralVerify) was used to classify contigs as viral, non-viral, or unsafe using the hidden Markov model (HMM) database. In second classification phase, contigs were classified with DIAMOND (v2.0.15) using the ‘diamond blastx’ command. The NCBI nr database [54] with the setting date on 5/3/2023 was used. MEGAN software (v6.22.2) with the MEGAN map database [33] and KronaTools (v.2.8.1) were used to visualize the results [51]. The final classification phase consisted of uploading the contigs of interest in MEGABLAST module of BLASTn webtool [55]. To assess the proper quality of selected contigs, MetaQUAST (v5.2.0) [56] and samtools (v1.16.1) were used [40]. Filtered long reads were assembled into contigs with Flye assembler (v2.9.1) [57] using the ‘meta’ flag, and contigs were subsequently polished with medaka (v1.7.2, https://github.com/nanoporetech/medaka). After assembly and polishing, contigs shorter than 300 bp were removed with seqtk (v1.3, https://github.com/lh3/seqtk) according to the minimum length ‘-L 300’. The remaining contigs were subjected to the same classification steps as described for short PE reads.

Reference genome mapping method

To validate the presence of microbial pathogens based on the classification results, filtered reads were mapped to selected reference genomes. Short PE reads were mapped with bwa (v0.7.17) [58] and long reads with minimap2 (v2.24) [50]. Samtools (v1.16.1) was used to calculate the coverage percentage and depth in the downstream analysis. The reads with the average quality of basecalled bases with Phread score < 10 and reads with average mapping quality score < 10 were excluded [40]. The validity of classification results for each detected clinically relevant pathogen was visually assessed according to the distribution of the reads across the respective reference genome. Wider distributions of mapped reads across reference genome indicated a more reliable detection as narrower distributions.

Comparative analysis on CAMI clinical pathogen detection challenge dataset

To provide a comparative benchmark to real clinical samples we used the CAMI Clinical Pathogen Detection Challenge dataset [59]. The dataset consists of simulated metagenomic samples designed to mimic real clinical environments. It includes a variety of pathogen sequences and background microbiota, as well as the target causative pathogen Orthonairovirus haemorrhagiae (CCHFV), enabling robust evaluation of pathogen detection methods. We utilized both the ground truth data provided with the challenge and the synthetic metagenomic reads for our analysis.

Computational performance

The computational performance of the presented workflow was evaluated in terms of computational cost and runtime for both short PE reads and long reads. We tested the hardware configurations ranging from lower (8 threads, 16 GB RAM) to upper (32 threads, 100 GB RAM) spectrum of commonly available computer capabilities.

Clinical samples selection and preparation

For this study, 10 samples from four different cases with different clinical presentations were included. For each patient a set of multiple samples from different biological material: tissue biopsy of heart, lungs, cerebrospinal fluid (CSF) and nasopharyngeal swab (NPs) were available (Table 1). Total nucleic acids were isolated using the EZ1 Virus Mini Kit v2.0 (QIAGEN, Redwood City, United States) and EZ1 advanced XL (QIAGEN, Redwood City, United States). Sequence-Independent, Single-Primer-Amplification (SISPA) was used for viral enrichment [60] with primers Sol-PrimerA (5′-GTTTCCCAGTCACGATC-N9–3′) and Sol-PrimerB (5′-GTTCCCAGTCACGATC-3′) [61]. The concentration of double-stranded cDNA and originating native DNA was measured using the Qubit dsDNA High Sensitivity Assay Kit on Qubit 3.0 (Thermo Fisher Scientific).

Table 1.

Clinical cases details regarding sample anatomical site origin, employed sequencing technology and conventional microbiological diagnostics results.

Clinical case Biological sample Sequencing technology Conventional diagnostics results
case 1 Nasopharyngeal swab
Heart - biopsy
Lungs - biopsy
Short PE reads
Long reads
rtRT-PCR: Rhino/Enterovirus (Ct 30.5)
culture: Haemophilus influenzae +++ / Streptococcus pneumoniae ++
culture: H. influenzae ++ / Streptococcus pseudopneumoniae +
culture: H. influenzae ++ / S. pneumoniae ++
case 2 Whole blood (EDTA)
CSF
Short PE reads No pathogen detected
No pathogen detected
case 3 Nasopharyngeal swab Short PE reads rtRT-PCR: SARS-CoV-2 (Ct 19.8)
rtRT-PCR: FluA – H3N2 (Ct 15.1)
case 4 Nasopharyngeal swab Short PE reads rtRT-PCR: FluA – H3N2 (Ct 15.5)

Library preparation and sequencing

For case 1 through case 4, short PE reads, libraries were prepared using the Nextera XT library preparation kit (Illumina, San Diego, CA, United States) according to the manufacturer’s instructions. The Qubit dsDNA High Sensitivity Assay on a Qubit 3.0 (Thermo Fisher Scientific) was used to measure library concentration and the Agilent HS DNA kit on the Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, United States) was used to measure fragment size. Sequencing was performed on the NextSeq 550 (Illumina, San Diego, CA, United States) using the NextSeq 500/550 High Output Kit v2.0 (300 cycles) (Illumina, San Diego, CA, United States).

For case 1 also long reads, libraries were prepared according the protocol of the Ligation Sequencing Kit for gDNA—Native Barcoding Kit 24 V14 (ONT, United Kingdom), aiming for a final concentration of 20 fmol. Concentration of libraries was measured with the Qubit dsDNA High Sensitivity Assay on a Qubit 3.0 (Thermo Fisher Scientific) and fragment length with the Agilent HS DNA Kit on the Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, United States). After checking for suitable pore activity, a MinION R10.4.1 Flow Cell (ONT, United Kingdom) with 1653 active pores was used. Sequencing was performed on a GridION instrument (ONT, United Kingdom) with MinKNOW software v22.12.5 (ONT, United Kingdom). Reads that with ≥200 bp in length and quality score of ≥10 were retained.

Data and code availability

The sequencing data underlying this article will be shared on reasonable request to the corresponding author. Workflow and Singularity containers are available in the GitHub repository (https://github.com/NGS-bioinf/MetaAll).

Results

MetaAll analysis step 1—reads classification; maximized sensitivity

After collecting metagenomics data (Table 2), only the clinically relevant pathogens were subjected to detailed analysis. In case 1, most classified short PE reads were observed in the NPs sample in which 1599 bacterial, 20 archaeal and 29 viral species were detected. A similar result was obtained from the lung sample in which 1590 bacterial, 11 archaeal and viral species were detected. In the heart sample, 724 bacterial, 20 archaeal and 29 viral species were detected. In case 2, 542 bacterial, 18 archaeal and 13 viral species were detected in a blood sample and 540 bacterial, nine archaeal and 20 viral species in a CSF sample. In case 3, 432 bacterial, 12 archaeal, and eight viral species were observed and in case 4, 328 bacterial, seven archaeal, and nine viral species were observed.

Table 2.

Basic classification results for short PE reads (Sr) and long reads (Lr). The number of detected bacteria, archaea and viral species with absolute numbers and percentage of classified reads is shown.

Sample Microbial detection Read classification
Bacteria species Archaea species Viral species Classified reads Host reads Bacterial reads Archaea reads Viral reads
Sr Lr Sr Lr Sr Lr Sr
[%]
Lr
[%]
Sr
[%]
Lr
[%]
Sr
[%]
Lr
[%]
Sr
[%]
Lr Sr
[%]
Lr
[%]
case 1–heart 724 247 20 0 29 0 11,072,591
98,9
33,894
95,9
5,679,579
84,92
8750
35,98
678,862
10,67
21,834
71,39
4712
0,07
0 105
< 0,01
0
case 1–lungs 1590 457 11 0 11 0 10,186,650
99,4
292,215
99,9
3,002,837
49,06
6909
3,74
2,929,635
49,36
272,178
97,52
2759
0,05
0 21
< 0,01
0
case 1–NPs 1599 455 7 0 15 0 17,529,037
99,5
1,531,674
99,99
2,977,326
37,26
232
0,03
13,150,209
81,54
1,469,310
99,98
43
< 0,01
0 29
< 0,01
0
case 2–blood 542 NA 18 NA 13 NA 6,989,891
99,2
NA 2,150,375
81,8
NA 287,604
11,77
NA 5361
0,22
NA 54
< 0,01
NA
case 2–csf 540 NA 9 NA 20 NA 4,767,841
98,2
NA 2,207,612
72,13
NA 163,288
6,88
NA 1007
0,04
NA 37
< 0,01
NA
case 3–NPs 432 NA 12 NA 8 NA 6,000,621
98,9
NA 290,247
10,79
NA 248,129
45,34
NA 4074
0,74
NA 4827
0,88
NA
case 4–NPs 328 NA 7 NA 9 NA 4,622,828
98,3
NA 223,429
6,17
NA 116,992
28,35
NA 987
0,24
NA 71,301
17,28
NA

In more detail, Haemophilus influenzae, Klebsiella pneumoniae and Streptococcus pneumoniae were detected in all samples from case 1 (Fig. 2). For H. influenzae, 23 reads (0.004% of all classified reads) and 79 unique k-mers were classified in the heart sample, 699 763 reads (71.25%) and 151 613 unique k-mers in the lung sample and 1 284 340 reads (34.72%) and 640 853 unique k-mers in a NPs sample. For K. pneumoniae, 8819 reads (1.4%) and 502 unique k-mers were detected in a heart sample, 5040 reads (0.51%) and 3095 unique k-mers in the lung sample, 10 410 reads (0.28%) and 4289 unique k-mers in a NPs sample. S. pneumoniae was classified with 74 reads (0.01%) and 156 unique k-mers from a heart sample, 4303 reads (0.44%) and 738 unique k-mers from a lung sample and 352 reads (0.01%) and 318 unique k-mers from an NPs sample. No relevant presence was detected among the viral pathogens. In case 2, K. pneumoniae was found in both samples. In the blood sample 4029 reads (1.54%) and 345 unique k-mers and, in the CSF 3031 reads (2.1%) and 901 unique k-mers were classified. No viral pathogens associated with meningoencephalitis were detected. In case 3, 4497 reads (1.94%) and 3774 unique k-mers were classified as FluA, of which 4187 reads (86.74%) and 2860 unique k-mers were assigned to genotype H3N2. SARS-CoV-2 was detected with 317 reads (0.14%) and 11 266 unique k-mers. In case 4, 71 286 reads (39.41%) and 5023 unique k-mers were classified as FluA. Here, 66 471 reads (93.23%) and 3456 unique k-mers were recorded as subtype H3N2.

Figure 2.

Figure 2

Ratio between the number of classified reads and the unique k-mers for each case and read length (Sr – Short reads, Lr – Long reads): (A) presents the classification results from case 1, (B) presents the classification results from case 2, (C) presents the classification results from case 3 and (D) presents the classification results from case 4. (A–C) additional close-ups are provided as a secondary panel in areas with high density of data points (marked as grey area). The proportion of classified reads (%) is presented as a spot diameter. Detected species that do not reach the minimum detection limit are marked as false positives. Recognized clinically relevant pathogens of interest contain descriptive notations.

Using long reads, a total of 455 bacterial species were found in the NPs, 457 in the lung sample and 247 in the heart samples of case 1 (Table 2). Archaeal and viral species were not observed. H. influenzae, K. pneumoniae and S. pneumoniae were similarly as for short PE reads, detected in all case 1’s samples. For H. influenzae, 3388 reads (27.16%) and 1886 unique k-mers were classified in a heart sample, 138 982 reads (84.22%) and 39 785 unique k-mers in the lung sample and 271 937 reads (36.58%) and 7648 unique k-mers in an NPs sample. Nine reads (0.07%) and 1564 unique k-mers were classified as K. pneumoniae in a heart sample, 84 reads (0.05%) and 3343 unique k-mers in the lung sample and 559 reads (0.08%) and 5432 unique k-mers in an NPs. For S. pneumoniae, 71 reads (0.6%) and 301 unique k-mers were observed in a heart sample, 1308 reads (0.8%) and 738 unique k-mers in the lung sample and 92 reads (0.01%) and 200 k-mers in an NPs sample. The presence of pathogenic viruses was not observed.

MetaAll analysis step 2—contigs classification; maximized specificity

For de novo contigs classification (Table 3), in case 1, 14 830 contigs assembled from short PE reads (49.2% of all contigs) were classified from the heart sample, of which 3779 (25.48%) microbial contigs were observed: 3717 (25.06%) bacterial, 47 (0.32%) archaeal and 15 (0.1%) viral. In the lung sample, 2252 (15.38%) microbial contigs were observed: 2188 (14.94%) bacterial, 41 (0.28%) archaeal and 23 (0.16%) viral, out of 14 645 (35.2%) classified contigs. From the NPs sample, we obtained 13 179 (32.4%) classified contigs and 4373 (33.18%) microbial contigs: 4280 (32.48%) bacterial, 47 (0.36%) archaeal and 46 (0.35%) viral. In case 2, 5270 (45.5%) contigs were classified from the blood sample. Here, 931 (17.67%) microbial contigs were obtained: 909 (17.25%) bacterial, 15 (0.29%) archaeal, and seven (0.13%) viral. From the CSF, 6337 (44.3%) contigs were obtained, including 1008 (15.91%) microbial contigs: 977 (15.42%) bacterial, 15 (0.24%) archaeal, and 16 (0.25%) viral. In case 3, 755 (81.8) contigs were obtained, of which 37 (4.9%) were microbial contigs: 25 (3.31%) bacterial and 12 (1.59%) viral. In case 4, out of 927 (73.2%) classified contigs, 47 (5.07%) microbial contigs were found: 34 (3.67%) bacterial and 13 (1.4%) viral.

Table 3.

Basic classification results for de novo assembled contigs from short PE reads (Sr) and long reads (Lr). The number of detected bacteria, archaea and viral species with absolute numbers and percentage of classified reads is shown.

Sample Number of contigs passed filtering Classified contigs Host contigs Bacteria contigs Archaea contigs Viral contigs
Sr Lr Sr Lr Sr Lr Sr Lr Sr Lr Sr Lr
case 1–heart 14,830
49.24
2
100
12,387
83.53
1
50
1644
13.27
0 2724
21.99
1
100
47
0.38
0 15
0.12
0
case 1–lungs 14,645
35.25
3
100
12,393
84.62
2
66.67
2046
16.51
0 1893
15.28
2
100
41
0.33
0 23
0.19
0
case 1–NPs 13,179
32.36
7
100
11,481
87.12
6
85.71
1350
11.76
0 3576
31.15
6
100
47
0.41
0 46
0.4
0
case 2–blood 5270
45.47
NA 4504
85.47
NA 580
12.88
NA 757
16.81
NA 15
0.33
NA 7
0.16
NA
case 2–csf 6337
44.29
NA 5387
85.01
NA 791
14.68
NA 783
14.54
NA 15
0.28
NA 16
0.3
NA
case 3–NPs 755
81.8
NA 617
81.72
NA 75
12.16
NA 21
3.4
NA 0 NA 12
1.95
NA
case 4–NPs 927
73.17
NA 749
80.8
NA 114
15.22
NA 26
3.47
NA 0 NA 13
1.74
NA

Microbiological identification results for case 1 revealed H. influenzae, K. pneumoniae and S. pneumoniae contigs in all samples. For H. influenzae, four contigs were obtained from the heart sample, 60 from the lung sample and 217 from the NPs. K. pneumoniae was detected with 58 contigs in the heart sample, 20 in the lung sample and 32 in the NPs. For S. pneumoniae, 43 contigs were classified in the heart sample, 19 in the lung sample and 22 in the NPs sample. Two contigs in the NPs sample were identified as Human rhinovirus C54. In case 2, K. pneumoniae was detected in both samples. Fifteen contigs were obtained from the blood sample and 10 from the CSF. The presence of other potential pathogens of atypical meningoencephalitis was not detected. In case 3, seven contigs were classified as FluA and four as SARS-CoV-2. In case 4, seven contigs of FluA were also found.

For case 1 in the heart sample two microbial contigs assembled from long reads were observed, one of which we classified as H. influenzae. The same situation was observed in the lung sample (Table 3). In NPs sample, six contigs were identified as bacterial, one of which was classified as H. influenzae (Fig. 3).

Figure 3.

Figure 3

Quality assessment of the classified contigs based on the ratio between the largest continuous alignment and the genome fraction covered (%) for each presented case: (A) case 1, (B) case 2, (C) case 3, and (D) case 4. Largest alignment length, presented on y-axis the longest length of the assembled contig without misassembly, according to METAQUAST results. (A–C) clinically relevant pathogens are distinguished from each other by graphic characteristics, while other non-pathogenic microorganisms are categorized under ‘other’.

MetaAll analysis step 3—reference genome mapping; validation of read and contig classification results

In case 1 using the short PE reads, we obtained the highest genome coverage for H. influenzae from the NPs sample (~89%) (Fig. 4). For the lung sample, despite lower genome coverage (~20%), we found a good distribution of reads over the entire genome length. For K. pneumoniae and S. pneumoniae, we obtained less than 3% genome coverage in all samples and a poor distribution of reads over the entire genome length. For human rhinovirus C54, we observed in NPs sample ~ 13% genome coverage (Fig. 4). In case 2, we observed <0.5% genome coverage with <1% mean depth for K. pneumoniae in both samples. In case 3 (Fig. 4), we obtained reliable mapping statistics for both SARS-CoV-2 (50% genome coverage) and FluA H3N2 (≥92% coverage in all segments). In case 4 (Fig. 4), we acquired good mapping results for FluA H3N2 (≥98% coverage in all segments except segment 8: 89%). Detailed information on all mapped pathogens are presented in Table 4.

Figure 4.

Figure 4

The results of the reference mapping of confirmed clinically relevant pathogens: (A) H. influenzae, (B) human rhinovirus C54, (C) SARS-CoV-2 and (D) FluA subtype H3N2. For each clinically relevant pathogen, the nucleotide positions are shown on the x-axis, while the y-axis shows the sequence depth for each nucleotide position.

Table 4.

Additional validation of the detected clinically relevant pathogens with read mapping. The following parameters were specified for the mapping results: number of reads, coverage bases, percent coverage, average depth of coverage, average quality of bases in the covered region and average quality of mapped reads.

Sample Pathogen
(NCBI acc. number)a
Number of reads Coverage bases Coverage [%] Mean depth Mean base quality Mean map quality
Sr Lr Sr Lr Sr Lr Sr Lr Sr Lr Sr Lr
case 1–heart H. influenzae
(NZ_CP007470.1)
8734 2021 17,549 11,957 0.95 0.65 <1 <1 30.8 30.1 42.2 42.0
K. pneumoniae
(NC_016845.1)
8290 17 4981 5832 0.09 0.11 <1 <1 29.6 30.9 40.2 17.8
S. pneumoniae
(NZ_CP020549.1)
46,172 9601 11,411 5085 0.53 0.24 1 1 32.9 30.8 51.9 46.6
case 1–lungs H. influenzae
(NZ_CP007470.1)
297,332 108,340 377,615 118,001 20.45 6.39 20 29 33.8 30.1 45.8 43.7
K. pneumoniae
(NC_016845.1)
12,246 253 23,236 18,755 0.44 0.35 <1 <1 32.3 29.5 40.3 21.8
S. pneumoniae
(NZ_CP020549.1)
374,534 105,593 30,399 7869 1.41 0.37 10 22 33.2 30.6 55.3 39.5
case 1–NPs H. influenzae
(NZ_CP007470.1)
431,551 56,027 1,635,109 43,180 88.56 2.34 28 13 33.3 30.6 49.5 40.0
K. pneumoniae
(NC_016845.1)
35,973 729 63,938 21,003 1.20 0.39 <1 <1 32.6 29.7 42.3 17.7
S. pneumoniae
(NZ_CP020549.1)
926,087 555,589 56,650 5014 2.64 0.23 28 110 33.0 30.8 57.0 41.7
Human rhinovirus C54
(PP187408.1)
26 0 936 0 13.27 0 <1 0 33.6 0 60.0 0
case 2–blood K. pneumoniae
(NC_016845.1)
4 NA 226 NA <0.01 NA <1 NA 24.9 NA 57.5 NA
case 2–csf K. pneumoniae
(NC_016845.1)
14,264 NA 14,325 NA 0.27 NA <1 NA 30.7 NA 42.4 NA
case 3–NPs SARS-CoV-2
(NC_045512.2)
731 NA 14,971 NA 50.07 NA 2 NA 32.6 NA 58.9 NA
FluA H3N2 segment 1
(NC_007373.1)
2200 NA 2296 NA 98.08 NA 118 NA 31.8 NA 60.0 NA
FluA H3N2 segment 2
(NC_007372.1)
1461 NA 2190 NA 93.55 NA 79 NA 31.8 NA 60.0 NA
FluA H3N2 segment 3
(NC_007371.1)
936 NA 2153 NA 96.42 NA 51 NA 32.0 NA 59.8 NA
FluA H3N2 segment 4
(NC_007366.1)
892 NA 1751 NA 99.38 NA 63 NA 31.1 NA 59.9 NA
FluA H3N2 segment 5
(NC_007369.1)
1855 NA 1535 NA 98.02 NA 148 NA 31.8 NA 60.0 NA
FluA H3N2 segment 6
(NC_007368.1)
1173 NA 1428 NA 97.34 NA 95 NA 31.7 NA 59.9 NA
FluA H3N2 segment 7
(NC_007367.1)
742 NA 982 NA 95.62 NA 91 NA 32.4 NA 60.0 NA
FluA H3N2 segment 8
(NC_007370.1)
68 NA 821 NA 92.25 NA 9 NA 32.3 NA 60.0 NA
case 4–NPs FluA H3N2 segment 1
(NC_007373.1)
32,983 NA 2331 NA 99.57 NA 1812 NA 31.7 NA 59.9 NA
FluA H3N2 segment 2
(NC_007372.1)
19,011 NA 2338 NA 99.87 NA 1036 NA 31.8 NA 59.9 NA
FluA H3N2 segment 3
(NC_007371.1)
12,974 NA 2218 NA 99.33 NA 748 NA 31.9 NA 60 NA
FluA H3N2 segment 4
(NC_007366.1)
9772 NA 1746 NA 99.09 NA 695 NA 31.4 NA 59.8 NA
FluA H3N2 segment 5
(NC_007369.1)
14,963 NA 1534 NA 97.96 NA 1225 NA 32 NA 59.9 NA
FluA H3N2 segment 6
(NC_007368.1)
17,380 NA 1451 NA 98.91 NA 1462 NA 31.6 NA 59.9 NA
FluA H3N2 segment 7
(NC_007367.1)
14,265 NA 1020 NA 99.32 NA 1784 NA 32.7 NA 60 NA
FluA H3N2 segment 8
(NC_007370.1)
582 NA 794 NA 89.21 NA 85 NA 32.3 NA 59.9 NA

achosen and downloaded from NCBI RefSeq database based on classification results

For H. influenzae, in case 1 using long reads, a lower genome coverage was observed, but a good distribution of reads over the entire genome length of the lung sample was obtained (~6%) (Fig. 4). For K. pneumoniae and S. pneumoniae, poor mapping statistics with <0.5% genome coverage and a poor distribution of reads over the entire genome length were obtained in all samples (Table 4).

CAMI clinical pathogen detection challenge dataset

Using the MetaAll approach, we were able to detect the causative pathogen CCHFV using read classification and reference mapping methods. The read classification method classified 22 reads and 296 unique k-mers as CCHFV. The method of mapping reads to the reference genome was positive for all three segments (Table 5). The contig classification method was negative. In addition to the causative pathogen, human immunodeficiency virus 1 (HIV-1) was also detected using all three methods: read classification (three reads and 63 unique k-mers), contig classification (402 bp long contig) and read mapping (Table 5).

Table 5.

Additional validation of the detected causative pathogen from CAMI dataset with read mapping. The following parameters were specified for the mapping results: number of reads, coverage bases, percent coverage, average depth of coverage, average quality of bases in the covered region and average quality of mapped reads.

Pathogen
(NCBI acc. number)a
Number of reads Coverage bases Coverage [%] Mean depth Mean base quality Mean map quality
CCHFV segment L
(NC_005301.3)
76 2369 19.56 <1 34.3 55.6
CCHFV segment M
(NC_005300.2)
33 1031 19.21 <1 33.9 57.9
CCHFV segment S
(NC_005302.1)
16 305 18.24 <1 32.7 59.6
HIV-1
(NC_001802.1)
50 1011 11.01 <1 33.8 57.8

achosen and downloaded from NCBI RefSeq database based on classification results

The average time and the range between the minimum and maximum time per sample were calculated for each setting used (Table 6). Seven samples contained short PE read data with a read count of 15-42 M (1.4–3.9 GB) and three samples contained long read data with a range of 2-5 M (1.8–3.9 GB) were used. For short PE reads, it can be seen that an increase in the hardware settings leads to a reduction in time, while an increase for long reads has no drastic effect.

Table 6.

Evaluation of the clinical application of the MetaAll workflow using different hardware settings. For each setting, the average time to analyse the samples and the range between the minimum and maximum duration is given.

Hardware configuration
(number of threads / memory in GB)
Average time per sample
(min time / max time)
(HH:MM)
Short PE reads Long reads
8 / 16
16 / 32
32 / 64
32 / 100
05:35
(02:41 / 09:28)
03:59
(02:12 / 06:35)
02:59
(01:33 / 05:04)
02:59
(01:26 / 04:15)
02:43
(02:35 / 03:45)
02:30
(02:09 / 03:40)
02:48
(02:19 / 03:40)
02:34
(02:19 / 03:39)

Discussion

In this study, we evaluated a novel three-step workflow (MetaAll) for the detection of DNA/RNA of potential pathogens in metagenomic samples, which was developed under the premise that a single method is not sufficient for accurate pathogen identification. Our results show that while many qualitative parameters need to be considered, certain thresholds—such as genome coverage, read depth, unique k-mers and the number of classified and mapped reads—are crucial for analysis and should be considered in combination rather than individually. Several studies have already addressed the problem of setting thresholds for the number of detected reads and k-mers. However, no strategy has been described that can sufficiently remove false-positive detections while retaining true positive detections with low abundance [32, 62–64]. The number of classified reads is influenced by a variety of parameters ranging from the biological background of the sample, wet lab procedures, sequencing technology, choice of reference database, classification tools and bioinformatic procedures. This suggests that thresholds should be used as a heuristic and not as an immutable boundary that perfectly separates positive and negative results. This can be seen in case 2, where Klebsiella Pneumoniae does not exceed the defined threshold, as do CCHFV and HIV in the CAMI dataset. Nonetheless, some heuristics threshold present a useful tool, such as the one presented in KrakenUniq. As mentioned in their study, when detecting pathogens in human patients with a read count threshold of 10 and a unique k-mer count threshold of 1000, many background identifications could be eliminated while retaining all true positives detected from only 15 reads [32]. This emphasizes the need for the results of metagenomic studies to be evaluated jointly by a consultant clinician, a microbiologist and a bioinformatician, rather than relying on thresholds. The MetaAll approach aims to fulfill this idea by combining different tools to elucidate a broad range of sequencing data features and qualitatively determine the potential presence or absence of clinically important pathogens, rather than establishing a fixed pivot point without context. The effectiveness of the workflow was validated in four different cases, including the detection of potentially life-threatening, systemic pathogens (case 1), atypical pathogens causing meningoencephalitis (case 2), and potential co-infections and genotyping of viral RNA (case 3 and case 4). Our approach differs from previous studies [39, 41, 65–67] by providing a comprehensive solution that can be adapted to different sequencing read types, as evidenced by its application to real-world clinical metagenomics data. In three of the four cases, we successfully detected the presence of DNA/RNA pathogens using the combination of the proposed approaches. In case 1, we successfully detected H. influenzae, K. pneumoniae and S. pneumoniae by read classification and contig assembly from short PE reads. Target reference mapping also confirmed the presence of H. influenzae and Human rhinovirus C54 in an NP sample using short PE reads, while K. pneumoniae and S. pneumoniae were effectively excluded. By combining different methods in a three-step workflow, reliable detection of H. influenzae and Human rhinovirus C54 was finally achieved, with this result agreeing with the results of conventional microbiological diagnostics. Due to the low number of classified reads and the low genome coverage, K. pneumoniae and S. pneumoniae were excluded as causative pathogens [68–71]. In case 3, we were able to reliably detect co-infection with SARS-CoV-2 and FluA, demonstrating the value of mNGS as a non-targeted method for identifying various co-infections [72, 73]. Genotyping of FluA reads as H3N2 and the high coverage achieved in reference genome mapping confirmed the presence of the subtype as 86.74% of the reads were classified as H3N2 genotype. When mapping to the reference genome, a coverage of ≥92% was found for all eight segments, confirming without doubt the H3N2 subtype. Good results were also obtained in case 4, where 93.23% of the reads could be mapped to the H3N2 genotype. The subtype was confirmed using the reference mapping method, with a coverage of ≥98% in all segments except segment 8 (89%). The use of mNGS for the comprehensive analysis of pathogens has already proven useful, not only for the genotyping of FluA, but also for other pathogens [74]. Overall, reliable results were obtained in the classification of reads for both RNA viruses, even for SARS-CoV-2, where up to 11 266 unique k-mers were identified from 317 short PE reads. These results are consistent with previous studies that emphasize the utility of mNGS for direct pathogen detection and co-infection identification [75, 76].

The study also addressed the reliability of interpretation of taxonomic classification results when only a single read of a microorganism could cause the observed symptoms. This strategy aimed to minimize the risk of detection failure while accounting for the increased likelihood of false positives. Therefore, the read classification step using KrakenUniq classifier was left at the default settings, which also take into account hits with only one read [32]. This method can be particularly advantageous for long reads, as a single long read can provide a large number of unique k-mers, which increases the significance of the results. For example, a good classification result for K. pneumoniae was obtained in case 1, where 1564 unique k-mers were observed from only nine long reads. A false positive hit for K. pneumoniae in case 1 highlighted the challenge for metagenomic profilers to distinguish true positives from a high rate of false identifications, which in some cases can account for more than 90% of the total number of species identified at read level [77]. Similar results were obtained for S. pneumoniae. Such cases highlight the complexity of mNGS results and the need for careful interpretation, especially when the analysis can be affected by transient microbial DNA from the origin of the sample [78, 79].

At the contig level, we used three phases for contig classification, consisting of the viralVerify with HMM, which enables the quick split of contigs into viral, non-viral or uncertain category, with low computational cost and intuitive output [80]. The viralVerify output informs and extends DIAMOND classification that uses the BLASTX command [33], and finally the BLASTn MEGABLAST module [55]. This order mitigates single computational model bias and provides quick and comprehensive insight into assembled contigs. Significantly more information was obtained by sequencing with short PE reads. In addition, the high accuracy of Illumina short sequences ensures the improved accuracy of metagenome-assembled genomes (MAGs). However, a limitation of the applicability of Illumina short PE reads may be the difficult assembly of exogenous elements, which leads to highly fragmented MAGs and high instrumental costs [37]. In this study, we observed the presence of numerous short contigs in all cases, with a low classification rate for assembled contigs, except in case 3 and case 4. As a possible consequence of the short contig lengths, a lower level of classified contigs was observed, with less than 50% of assembled contigs being classified, except in case 3 (81.8%) and case 4 (73.2%). The choice of contig filtering length is a balancing act between taxonomic resolution, computational efficiency, contig coverage and quality. The problem of short contig lengths in real metagenomic data has also been highlighted in previous studies [81, 82], but for metagenomic studies where the aim is to identify a broad range of organisms, shorter contigs may prove useful. In our study, we opted for a minimum filter length of the combined length of the R1 and R2 reads (300 bp), as we obtained 2x150 bp reads from the sequencer. Shorter lengths may result in a greater number of unclassified contigs, while higher length thresholds pose an increased risk of not detecting pathogens, especially RNA viruses. The potential data loss caused by setting higher contig lengths is compensated by the read classification step, where all reads remaining after the preprocessing step are classified regardless of their length. Nevertheless, the MetaAll workflow allows the user to set the minimum contig length to meet the requirements of each individual experiment.. This was demonstrated in case 1 and case 3 using the MEGABLAST tool. In case 1, two contigs were found for Human rhinovirus C54, with a 554 bp long contig having a bit score of 843, an E-value of 0.0 and an identity of 98.94%, while a 339 bp long contig had a bit score of 593, an E-value of 8e-165 and an identity of 98.23%. In case 3, a reliable hit for SARS-CoV-2 was obtained from both contigs with a length of less than 400 bp. A 376 bp long contig had a bit score of 2832, an E-value of 0.0 and 99.61% identity, while a 320 bp long contig had a bit score of 592, an E-value of 3e-164 and 100% identity. In this study, we also used case 1 to investigate the potential of ONT long reads sequencing technology. Despite the small amount of data, we were able to confirm the presence of H. influenzae by using a method to classify contigs consisting of long reads. Recently, the ONT community platform introduced the use of Flow Cell Light Shields to improve yield, especially for shorter reads obtained when sequencing on a MinION R10.4.1 Flow Cell. Light shields were not yet available for our experiment, so the question is to what extent the amount of data can be increased to obtain more contigs and thus improve the contig classification method.

Finally, our workflow utilized target reference mapping to confirm or refute the classification results and confirm co-infections and genotypes in different cases. With this step of the proposed approach, we were able to reliably confirm co-infection with H. influenzae and Human rhinovirus C54 in case 1. In case 2, the presence of K. pneumoniae was refuted due to the low mapping statistics. In both case 3 and case 4, the FluA subtype H3N2 was successfully genotyped using the reference mapping approach. In our workflow, all reads were used for target reference mapping, as this preserves reads that might otherwise be discarded due to incorrect taxonomic classification. A possible alternative approach to pathogen detection would be to map only classified reads [35] or to map reads to comprehensive viral datasets [41]. However, such an approach emphasizes the importance of considering all analysis components when interpreting the results.

The workflow presented here consists of well-known detection tools that have been tested in previous studies [30, 32, 33]. In addition, its applicability was tested using the CAMI dataset from Clinical pathogen detection challenge and real clinical samples where the causative pathogens were known except for one case. The results illustrate the idea that metagenomic analysis and interpretation must go beyond the relatively narrow scope of established tools and take into account the holistic information obtained from individual tools. KrakenUniq extends the utility of Kraken by providing a unique k-mer count using the HyperLogLog algorithm and is the only metagenomics classifier that provides k-mer coverage information [32]. Nevertheless, even this information could be invaluable in determining the quality of the classification, especially in situations where the presence of low-abundance pathogens needs to be distinguished from false positives. With other tools such as Centrifuge or Kraken2, we get information on the number of reads, but no information on the distribution of reads across the genome [83, 84]. While MetaPhlAn is a very accurate tool for profiling the composition of microbial communities (bacteria, archaea and eukaryotes) from metagenomic sequencing data, it still has difficulties in classifying viruses [85]. In this context, MetaAll represents the approach that maximizes sensitivity (in the clinical cases presented and the CAMI dataset, MetaAll had a sensitivity of 100%) while attempting to retain specificity through method triangulation, qualitative assessment and expert input. Sensitivity and specificity in metagenomic studies can be influenced by several factors, including sample type, amount of host DNA, sequencing platform, number of reads generated, reference database chosen and data analysis tools. In addition, cost and limited expertise can also be a challenge. MetaAll’s approach provides a reliable solution that effectively addresses these issues.

To test the reliability of detection based on known data, the CAMI dataset was used to detect the causative pathogen of CCHFV [59]. In addition to the successful detection of the causative pathogen, HIV-1 was also successfully detected, but the complexity of interpreting the results is well documented. The read classification method was below threshold, the contig classification method was negative for causative pathogen, and the mapping method confirmed the previous detection of the read classification.

An important strength of the presented workflow is that it can run with lower hardware settings (8 threads and 16 GB of memory), which makes it suitable for most available computational configurations. Although lower settings increase the computation time for analysing short PE reads, they do not have a drastic effect when analysing long reads.

While the proposed MetaAll approach is clearly useful and comprehensive, it also encountered some limitations. In case 1, Human rhinovirus C54 was not detected by the read classification method. During the investigation, it was found that the KrakenUniq standard database does not contain Human rhinovirus C54 with the accession number (KP282614.1). This problem could be solved by integrating a microbial nt or viral database [32, 86]. In case 2, no pathogen causing atypical meningoencephalitis could be reliably detected. Although some hits for K. pneumoniae were obtained with both classification methods, validation with read mapping showed that this bacterium was not present in the sample. In addition to the number of pathogen reads, abundance estimation could also be used for detection validation [87], but there is a possibility of misinterpretation as highly abundant species do not always represent the true species and false positives are not necessarily limited to those with low abundance [77]. Therefore, the probability of a complete separation of false-positive and true-positive pathogens is still low when using only the mNGS method. This suggests that additional testing is required to confirm the presence of pathogens, with mNGS potentially complementing conventional methods, as a comparative study of immunocompromised sepsis patients found that mNGS failed to detect additional pathogens in 12% of cases and gave false or insignificant results in 9% of cases compared with conventional culture methods, as reported in the literature review [78].

In summary, we have introduced a novel three-step workflow (MetaAll) for the detection of DNA/RNA of clinically relevant pathogens with improved reliability. Our results highlight the complexity of interpreting mNGS results and show that a single bioinformatics method is not always optimal, but also not always sufficient. When making decisions, it is crucial to consider all components of the analysis to minimize the likelihood of false positive or negative results. The mNGS method in conjunction with MetaAll greatly enhances our ability to diagnose pathogen infections, explore microbial diversity and effectively respond to emerging pathogens with a higher degree of robustness and accuracy while maintaining adequate sensitivity. As sequencing technologies evolve, metagenomic sequencing will increasingly be used as a valuable tool in clinical diagnostics and will continue to drive the development of such tools.

Key Points

  • A novel three-step workflow was developed for the detection of DNA/RNA of potential pathogens in metagenomic samples.

  • The importance of combining read classification, contig validation and targeted reference mapping for more reliable detection of infectious agents in clinical metagenome samples was presented.

  • The demonstration has shown that it is not always sufficient to use a single method to detect pathogens and that a variety of components must be taken into account during identification.

Acknowledgements

The authors would like to thank to Doroteja Vlaj for excellent mNGS wet-laboratory work.

Issue Section: Problem Solving Protocol

Contributor Information

Martin Bosilj, Institute of Microbiology and Immunology, Faculty of Medicine, University of Ljubljana, Zaloška cesta 4, 1000 Ljubljana, Slovenia.

Alen Suljič, Institute of Microbiology and Immunology, Faculty of Medicine, University of Ljubljana, Zaloška cesta 4, 1000 Ljubljana, Slovenia.

Samo Zakotnik, Institute of Microbiology and Immunology, Faculty of Medicine, University of Ljubljana, Zaloška cesta 4, 1000 Ljubljana, Slovenia.

Jan Slunečko, Institute of Microbiology and Immunology, Faculty of Medicine, University of Ljubljana, Zaloška cesta 4, 1000 Ljubljana, Slovenia.

Rok Kogoj, Institute of Microbiology and Immunology, Faculty of Medicine, University of Ljubljana, Zaloška cesta 4, 1000 Ljubljana, Slovenia.

Misa Korva, Institute of Microbiology and Immunology, Faculty of Medicine, University of Ljubljana, Zaloška cesta 4, 1000 Ljubljana, Slovenia.

 

Conflict of interest: None declared.

Funding

This work was supported by the Institute of Microbiology and Immunology, Faculty of Medicine, University of Ljubljana and Slovenian Research and Innovation Agency (grants P3–0083, J3–2515, J3–50101). The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript or in the decision to publish the results.

References

  • 1. Ye SH, Siddle KJ, Park DJ. et al. Benchmarking metagenomics tools for taxonomic classification. Cell 2019;178:779–94. 10.1016/j.cell.2019.07.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. John G, Sahajpal NS, Mondal AK. et al. Next-generation sequencing (NGS) in COVID-19: A tool for SARS-CoV-2 diagnosis, monitoring new strains and phylodynamic modeling in molecular epidemiology. Curr Issues Mol Biol 2021;43:845–67. 10.3390/cimb43020061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Vries JJC, Brown JR, Couto N. et al. Recommendations for the introduction of metagenomic next-generation sequencing in clinical virology, part II: Bioinformatic analysis and reporting. J Clin Virol 2021;138:104812. [DOI] [PubMed] [Google Scholar]
  • 4. Carbo EC, Sidorov IA, Zevenhoven-Dobbe JC. et al. Coronavirus discovery by metagenomic sequencing: A tool for pandemic preparedness. J Clin Virol 2020;131:104594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Zhou P, Yang X-L, Wang X-G. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 2020;579:270–3. 10.1038/s41586-020-2012-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Gu W, Deng X, Lee M. et al. Rapid pathogen detection by metagenomic next-generation sequencing of infected body fluids. Nat Med 2021;27:115–24. 10.1038/s41591-020-1105-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Forbes JD, Knox NC, Ronholm J. et al. Metagenomics: The next culture-independent game changer. Front Microbiol 2017;8:1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Miao Q, Ma Y, Wang Q. et al. Microbiological diagnostic performance of metagenomic next-generation sequencing when applied to clinical practice. Clin Infect Dis 2018;67:S231–40. 10.1093/cid/ciy693. [DOI] [PubMed] [Google Scholar]
  • 9. Zhang H-C, Ai J-W, Cui P. et al. Incremental value of metagenomic next generation sequencing for the diagnosis of suspected focal infection in adults. J Infect 2019;79:419–25. 10.1016/j.jinf.2019.08.012. [DOI] [PubMed] [Google Scholar]
  • 10. Diao Z, Han D, Zhang R. et al. Metagenomics next-generation sequencing tests take the stage in the diagnosis of lower respiratory tract infections. Journal of Advanced Research 2022;38:201–12. 10.1016/j.jare.2021.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Ramesh A, Nakielny S, Hsu J. et al. Metagenomic next-generation sequencing of samples from pediatric febrile illness in Tororo, Uganda. PloS One 2019;14:e0218318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Zhou H, Larkin PMK, Zhao D. et al. Clinical impact of metagenomic next-generation sequencing of bronchoalveolar lavage in the diagnosis and Management of Pneumonia: A multicenter prospective observational study. J Mol Diagn 2021;23:1259–68. 10.1016/j.jmoldx.2021.06.007. [DOI] [PubMed] [Google Scholar]
  • 13. Guo W, Cui X, Wang Q. et al. Clinical evaluation of metagenomic next-generation sequencing for detecting pathogens in bronchoalveolar lavage fluid collected from children with community-acquired pneumonia. Front Med 2022;9:952636. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Chaitanya KV. Structure and Organization of Virus Genomes. Genome and Genomics: From Archaea to Eukaryotes. Springer, 2019; 1–30. 10.1007/978-981-15-0702-1_1. [DOI]
  • 15. Mohsin H, Asif A, Fatima M. et al. Potential role of viral metagenomics as a surveillance tool for the early detection of emerging novel pathogens. Arch Microbiol 2021;203:865–72. 10.1007/s00203-020-02105-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Raju RS, Al Nahid A, Chondrow Dev P. et al. VirusTaxo: Taxonomic classification of viruses from the genome sequence using k-mer enrichment. Genomics 2022;114:110414. [DOI] [PubMed] [Google Scholar]
  • 17. Delwart EL. Viral metagenomics. Rev Med Virol 2007;17:115–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Alavandi SV, Poornima M. Viral metagenomics: A tool for virus discovery and diversity in aquaculture. Indian J Virol 2012;23:88–98. 10.1007/s13337-012-0075-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Slavov SN. Viral metagenomics for identification of emerging viruses in transfusion medicine. Viruses 2022;14:2448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Bidzhieva B, Zagorodnyaya T, Karagiannis K. et al. Deep sequencing approach for genetic stability evaluation of influenza a viruses. J Virol Methods 2014;199:68–75. 10.1016/j.jviromet.2013.12.018. [DOI] [PubMed] [Google Scholar]
  • 21. Hall RJ, Draper JL, Nielsen FGG. et al. Beyond research: A primer for considerations on using viral metagenomics in the field and clinic. Front Microbiol 2015;6:224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Dutilh BE. Metagenomic ventures into outer sequence space. Bacteriophage 2014;4:e979664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Lewandowska DW, Zagordi O, Geissberger F-D. et al. Optimization and validation of sample preparation for metagenomic sequencing of viruses in clinical samples. Microbiome 2017;5:94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Wylie KM, Wylie TN, Buller R. et al. Detection of viruses in clinical samples by use of metagenomic sequencing and targeted sequence capture. J Clin Microbiol 2018;56:e01123-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Hilton SK, Castro-Nallar E, Pérez-Losada M. et al. Metataxonomic and metagenomic approaches vs. culture-based techniques for clinical pathology. Front Microbiol 2016;7:484. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Somasekar S, Lee D, Rule J. et al. Viral surveillance in serum samples from patients with acute liver failure by metagenomic next-generation sequencing. Clin Infect Dis 2017;65:1477–85. 10.1093/cid/cix596. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Zhang J, Gao L, Zhu C. et al. Clinical value of metagenomic next-generation sequencing by Illumina and nanopore for the detection of pathogens in bronchoalveolar lavage fluid in suspected community-acquired pneumonia patients. Front Cell Infect Microbiol 2022;12:1021320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Greninger AL, Naccache SN, Federman S. et al. Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis. Genome Med 2015;7:99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Lee H-J, Cho I-S, Jeong R-D. Nanopore metagenomics sequencing for rapid diagnosis and characterization of lily viruses. Plant Pathol J 2022;38:503–12. 10.5423/PPJ.OA.06.2022.0084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Vries JJC, Brown JR, Fischer N. et al. Benchmark of thirteen bioinformatic pipelines for metagenomic virus diagnostics using datasets from clinical samples. J Clin Virol 2021;141:104908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Junier T, Huber M, Schmutz S. et al. Viral metagenomics in the clinical realm: Lessons learned from a Swiss-wide ring trial. Genes 2019;10:655. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Breitwieser F. KrakenUniq: Confident and Fast Metagenomics Classification Using Unique k-Mer Counts. Genome Biology. 2022;19:198. 10.1186/s13059-018-1568-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Bağcı C, Patz S, Huson DH. DIAMOND+MEGAN: Fast and easy taxonomic and functional analysis of short and long microbiome sequences. Current Protocols 2021;1:e59. [DOI] [PubMed] [Google Scholar]
  • 34. Miller RR, Montoya V, Gardy JL. et al. Metagenomics for pathogen detection in public health. Genome Med 2013;5:81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Lu J, Rincon N, Wood DE. et al. Metagenome analysis using the kraken software suite. Nat Protoc 2022;17:2815–39. 10.1038/s41596-022-00738-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Alawi M, Burkhardt L, Indenbirken D. et al. DAMIAN: An open source bioinformatics tool for fast, systematic and cohort based analysis of microorganisms in diagnostic samples. Sci Rep 2019;9:16841. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Xia Y, Li X, Wu Z. et al. Strategies and tools in illumina and nanopore-integrated metagenomic analysis of microbiome data. iMeta 2023;2:e72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Rodríguez-Brazzarola P, Pérez-Wohlfeil E, Díaz-del-Pino S. et al. Analyzing the differences between reads and contigs when performing a taxonomic assignment comparison in metagenomics. Bioinformatics and Biomedical Engineering 2018;10813:450–60. [Google Scholar]
  • 39. Tamames J, Cobo-Simón M, Puente-Sánchez F. Assessing the performance of different approaches for functional and taxonomic annotation of metagenomes. BMC Genomics 2019;20:960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Danecek P, Bonfield JK, Liddle J. et al. Twelve years of SAMtools and BCFtools. Gigascience 2021;10:giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Kim K, Park K, Lee S. et al. VirPipe: An easy-to-use and customizable pipeline for detecting viral genomes from nanopore sequencing. Bioinformatics 2023;39:btad293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Mölder F, Jablonski KP, Letcher B. et al. Sustainable data analysis with Snakemake. F1000Res 2021;10:33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PloS One 2017;12:e0177459. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Andrews S. FastQC: A Quality Control Tool for High Throughput Sequence Data, 2010. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc.
  • 45. Ewels P, Magnusson M, Lundin S. et al. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 2016;32:3047–8. 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Bushnell B. BBMap: A Fast, Accurate, Splice-Aware Aligner, 2014. Available online at: https://www.osti.gov/biblio/1241166.
  • 47. Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods 2012;9:357–9. 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. De Coster W, D’Hert S, Schultz DT. et al. NanoPack: Visualizing and processing long-read sequencing data. Bioinformatics 2018;34:2666–9. 10.1093/bioinformatics/bty149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Bonenfant Q, Noé L, Touzet H. Porechop_ABI: Discovering Unknown Adapters in ONT Sequencing Reads for Downstream Trimming, 2022. 10.1101/2022.07.07.499093. [DOI] [PMC free article] [PubMed]
  • 50. Li H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 2021;37:4572–4. 10.1093/bioinformatics/btab705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a web browser. BMC Bioinformatics 2011;12:385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Breitwieser FP, Salzberg SL. Pavian: interactive analysis of metagenomics data for microbiomics and pathogen identification. Bioinformatics 2016;36:1303–4. 10.1093/bioinformatics/btz715. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Nurk S, Meleshko D, Korobeynikov A. et al. metaSPAdes: A new versatile metagenomic assembler. Genome Res 2017;27:824–34. 10.1101/gr.213959.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Sayers EW, Bolton EE, Brister JR. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res 2022;50:D20–6. 10.1093/nar/gkab1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Morgulis A, Coulouris G, Raytselis Y. et al. Database indexing for production MegaBLAST searches. Bioinformatics 2008;24:1757–64. 10.1093/bioinformatics/btn322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Mikheenko A, Saveliev V, Gurevich A. MetaQUAST: Evaluation of metagenome assemblies. Bioinformatics 2016;32:1088–90. 10.1093/bioinformatics/btv697. [DOI] [PubMed] [Google Scholar]
  • 57. Kolmogorov M, Bickhart DM, Behsaz B. et al. metaFlye: Scalable long-read metagenome assembly using repeat graphs. Nat Methods 2020;17:1103–10. 10.1038/s41592-020-00971-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 2009;25:1754–60. 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Meyer F, Fritz A, Deng Z-L. et al. Critical assessment of metagenome interpretation: The second round of challenges. Nat Methods 2022;19:429–40. 10.1038/s41592-022-01431-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Chrzastek K, Lee D-H, Smith D. et al. Use of sequence-independent, single-primer-amplification (SISPA) for rapid detection, identification, and characterization of avian RNA viruses. Virology 2017;509:159–66. 10.1016/j.virol.2017.06.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Moore SC, Penrice-Randal R, Alruwaili M. et al. Amplicon-based detection and sequencing of SARS-CoV-2 in nasopharyngeal swabs from patients with COVID-19 and identification of deletions in the viral genome that encode proteins involved in interferon antagonism. Viruses 2020;12:1164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Marić J, Križanović K, Riondet S. et al. Comparative analysis of metagenomic classifiers for long-read sequencing datasets. BMC Bioinformatics 2024;25:15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Smith RH, Glendinning L, Walker AW. et al. Investigating the impact of database choice on the accuracy of metagenomic read classification for the rumen microbiome. Animal Microbiome 2022;4:57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Portik DM, Brown CT, Pierce-Ward NT. Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets. BMC Bioinformatics 2022;23:541. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Tran Q, Phan V. Assembling reads improves taxonomic classification of species. Genes (Basel) 2020;11:946. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Stamouli S, Beber ME, Normark T. et al. nf-core/taxprofiler: highly parallelised and flexible pipeline for metagenomic taxonomic classification and profiling. 2023; 2023.10.20.563221. 10.1101/2023.10.20.563221. [DOI]
  • 67. Rosenboom I, Scheithauer T, Friedrich FC. et al. Wochenende—modular and flexible alignment-based shotgun metagenome analysis. BMC Genomics 2022;23:748. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Chrisman B, He C, Jung J-Y. et al. The human “contaminome”: Bacterial, viral, and computational contamination in whole genome sequences from 1000 families. Sci Rep 2022;12:9863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Sangiovanni M, Granata I, Thind AS. et al. From trash to treasure: Detecting unexpected contamination in unmapped NGS data. BMC Bioinformatics 2019;20:168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Ashokan A, Papanicolas LE, Leong LEX. et al. Case report: Identification of intra-laboratory blood culture contamination with Staphylococcus aureus by whole genome sequencing. Diagn Microbiol Infect Dis 2019;94:331–3. 10.1016/j.diagmicrobio.2019.02.016. [DOI] [PubMed] [Google Scholar]
  • 71. Strong MJ, Xu G, Morici L. et al. Microbial contamination in next generation sequencing: Implications for sequence-based analysis of clinical samples. PLoS Pathog 2014;10:e1004437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Liang X, Wang Q, Liu J. et al. Coinfection of SARS-CoV-2 and influenza a (H3N2) detected in bronchoalveolar lavage fluid of a patient with long COVID using metagenomic next−generation sequencing: A case report. Front Cell Infect Microbiol 2023;13:1224794. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Chen Y, Fan L-C, Chai Y-H. et al. Advantages and challenges of metagenomic sequencing for the diagnosis of pulmonary infectious diseases. Clin Respir J 2022;16:646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Chen L, Liu W, Zhang Q. et al. RNA based mNGS approach identifies a novel human coronavirus from two individual pneumonia cases in 2019 Wuhan outbreak. Emerging Microbes & Infections 2020;9:313–9. 10.1080/22221751.2020.1725399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Morsli M, Kerharo Q, Delerce J. et al. Haemophilus influenzae meningitis direct diagnosis by metagenomic next-generation sequencing: A case report. Pathogens 2021;10:461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Lamprecht P, Fischer N, Huang J. et al. Changes in the composition of the upper respiratory tract microbial community in granulomatosis with polyangiitis. J Autoimmun 2019;97:29–39. 10.1016/j.jaut.2018.10.005. [DOI] [PubMed] [Google Scholar]
  • 77. Sun Z, Liu J, Zhang M. et al. Removal of false positives in metagenomics-based taxonomy profiling via targeting type IIB restriction sites. Nat Commun 2023;14:5321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Li X, Liang S, Zhang D. et al. The clinical application of metagenomic next-generation sequencing in sepsis of immunocompromised patients. Frontiers in cellular and infection. Microbiology 2023;13:1170687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Hogan CA, Yang S, Garner OB. et al. Clinical impact of metagenomic next-generation sequencing of plasma cell-free DNA for the diagnosis of infectious diseases: A multicenter retrospective cohort study. Clin Infect Dis 2021;72:239–45. 10.1093/cid/ciaa035. [DOI] [PubMed] [Google Scholar]
  • 80. Antipov D, Raiko M, Lapidus A. et al. MetaviralSPAdes: Assembly of viruses from metagenomic data. Bioinformatics 2020;36:4126–9. 10.1093/bioinformatics/btaa490. [DOI] [PubMed] [Google Scholar]
  • 81. Naccache SN, Federman S, Veeraraghavan N. et al. A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res 2014;24:1180–92. 10.1101/gr.171934.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Deng X, Naccache SN, Ng T. et al. An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data. Nucleic Acids Res 2015;43:e46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Kim D, Song L, Breitwieser FP. et al. Centrifuge: Rapid and sensitive classification of metagenomic sequences. Genome Res 2016;26:1721–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with kraken 2. Genome Biol 2019;20:257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Zolfo M, Silverj A, Blanco-Míguez A. et al. Discovering and exploring the hidden diversity of human gut viruses using highly enriched virome samples. 2024. 10.1101/2024.02.19.580813. [DOI]
  • 86. Lewandowska DW, Zagordi O, Zbinden A. et al. Unbiased metagenomic sequencing complements specific routine diagnostic methods and increases chances to detect rare viral strains. Diagn Microbiol Infect Dis 2015;83:133–8. 10.1016/j.diagmicrobio.2015.06.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Charalampous T, Alcolea-Medina A, Snell LB. et al. Evaluating the potential for respiratory metagenomics to improve treatment of secondary infection and detection of nosocomial transmission on expanded COVID-19 intensive care units. Genome Med 2021;13:182. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The sequencing data underlying this article will be shared on reasonable request to the corresponding author. Workflow and Singularity containers are available in the GitHub repository (https://github.com/NGS-bioinf/MetaAll).


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES