Abstract
The Professional Committee of Microbiology of the National Pharmacopoeia Commission organized the drafting of the Technical Guidelines for Microbial Whole Genome Sequencing (WGS), aiming to standardize the method process and technical indicators of microbial WGS and ensure the accuracy of sequencing and identification. On the basis of the Guidelines, we developed an integrated microbial identification and source tracking (MIST) system, which could meet the needs of microbial identification and contamination investigation in food and drug quality control. MIST integrates three analysis pipelines: 16S/18S/internal transcribed spacer amplicon‐based microbial identification, WGS‐based microbial identification, and single‐nucleotide polymorphism‐based microbial source tracking. MIST can analyze sequence data in a variety of formats, such as Fasta, base call file, and FASTQ. It can be connected to a high‐throughput sequencing instrument to acquire sequencing data directly. We also developed a publicly accessible web server for MIST (http://syj.i-sanger.cn).
Microbial identification is of great value for clinical, epidemiological, food, and pharmaceutical research [1]. Traditionally, microbes have been identified based on their morphological, physical, and biochemical properties [2]. However, many prokaryotic microbes are difficult to culture using traditional methods [3] and thus cannot be detected by traditional methods. These unculturable microbes harbor a potential source of novel metabolites and are essential components of natural metabolic networks [4]. Moreover, traditional methods also fail to detect novel culturable microbes and have problems in detecting unusual microbes that have not been comprehensively evaluated [5]. High‐throughput sequencing technology (HTS) has enabled sequence‐based genomics to become one of the routine and promising methods for microbial identification [6]. HTS‐based methods can be subdivided into two categories: amplicon sequencing [7], which amplifies conserved sequences in microbes (e.g., 16S ribosomal RNA [rRNA] for bacteria and 18S recombinant DNA [rDNA]/internal transcribed spacer [ITS] region for fungi), and whole genome sequencing (WGS) [8], which sequences the whole genomes of a microbe after isolation. The 16S rDNA‐based amplicon sequencing is an efficient method to investigate all bacteria in a sample because this region has been recognized as the conventional method for prokaryotic identification [9]. The community has accumulated a large amount of well‐characterized 16S rDNA sequences in large databases, such as Ribosomal Database Project [10] and SILVA [11]. Amajor limitation of amplicon sequencing is its lack of discrimination among closely related species [12]. WGS‐based bacterial identification provides higher discriminatory power and allows bacterial identification at species or even at strain level. It also provides a powerful way for investigating functional genes, such as antibiotic resistance genes (ARGs) [13, 14] and virulence factors genes (VFGs) [15]. Furthermore, the multilocus sequence type (MLST) [16] and single‐nucleotide polymorphism (SNP) [17, 18, 19] enable source tracking of genetically closely related bacteria that were isolated from different sources. Such analysis enables WGS‐based applications in multiple fields, such as forensic investigations, strain identification, and outbreak tracking [20].
Currently, there are some web services and tools for microbial identification, for example, BacWGSTdb [21], ImageGP [22], Bacterial Analysis Pipeline (CGE) (https://cge.cbs.dtu.dk/services/cge/) [23], Qiime2 [24], EasyAmplicon [25], GCType (GCM Type Strain Sequencing project), and rANOMALY [26]. Each website has its own unique strengths and limitations. For example, BacWGSTdb offers MLST‐based and whole‐genome‐based bacterial genotyping but only accepts assembly genome files as inputs. CGE provides various tools for genome‐based phenotyping, phylogeny, and annotation of ARGs and VFGs. However, users should upload their data into FASTQ each of these tools separately due to the lack of an integrated backend. Furthermore, all web‐based tools require a fast and consistent internet connection to upload raw sequence files, which can have sizes of hundreds to thousands of MBs [8]. With the development of NGS technology, the downstream bioinformatics analysis is challenging, and more software and systems need to be developed [27, 28].
Here, we present a system for the classification and identification of microbes. It implements sophisticated pipelines for both amplicon sequencing data, which enable efficient profiling of unculturable microbes, and WGS data, which enable accurate genotyping of cultured microbes. The system also implements pipelines for the MLST, SNP‐based source tracking, and ARGs or VFGs annotation from WGS data.
RESULTS
Overview of the system
The system consists of three pipelines: (1) amplicon‐based microbial identification, such as 16S rDNA/18S rDNA/ITS genes, (2) WGS‐based microbial identification, and (3) SNP‐based source tracking. To initiate the analysis, users only need to choose sequencing files in base call file or FASTQ format generated by Illumina sequencer, or Fasta‐formatted sequence files (such as assembled genomes or 16S sequences) into the server. Then, users can create a task by selecting a pipeline and setting corresponding parameters. Finally, sequencing data and parameters are submitted to the server and trigger the analytic pipelines (Figure 1A). The system provides mainstream reference databases for microbial identification and functional annotation (Figure 1B). We also have a data management system that is responsible for monitoring the processing tasks and managing the database, such as inputs and outputs files (Figure 1D). Users can view the task results on the online interactive analysis report interface and download the results for further use (Figure 1C).
Figure 1.
Overview of the MIST system. (A) Five data analysis steps implemented in the MIST. (B) Integrated databases included in MIST. (C) Examples of data visualization. (D) Home Page of the MIST web server. CAZy, carbohydrate‐active enzymes; MIST, microbial identification and source tracking; PE, paired end; rDNA, recombinant DNA; SNP, single‐nucleotide polymorphism; VFDB, virulence factor database; WGS, whole genome sequencing.
Pipeline 1: 16S rDNA/18S rDNA/ITS amplicon‐based microbial identification
This pipeline can be used to identify microbes, cultured or uncultured, using 16S/18S rDNA and ITS regions. The pipeline contains “Quality Control,” “Primer Removal,” “Denoising,” “Annotation,” and “Evaluation” functional components. In short, Fastp v0.23.4 [29] was used to perform quality control and clean the paired‐end (PE) FASTQ reads by trimming and filtering reads based on their quality and length. The reads were truncated at any site receiving an average quality score of <20 over a 50 bp sliding window, and the truncated reads shorter than 50 bp were discarded; reads containing ambiguous characters were also discarded. The resulting reads were subjected to the server for merging the pair‐end reads, followed by primer removal by a homemade Python script, duplicate removal by vsearch v2.22.1 [30], and denoising by deblur v1.1.1 [31]. The procedure above generates a set of amplicon sequence variants (ASVs), which were each treated as a taxonomic unit. Each ASV was then aligned to a reference genome database using BLASTn v2.11.0 [32]. The taxonomic classification of ASV was estimated by best‐hits matches in the reference database. Phylogenetic tree was constructed by the maximum likelihood (ML) method. The workflow is illustrated in Figure 2A.
Figure 2.
Workflow of three pipelines. (A) Amplicon‐based microbial identification, such as 16S rDNA/18S rDNA/ITS genes. (B) SNP‐based source tracking. (C) WGS‐based microbial identification. ANI, Average Nucleotide Identity; ARGs, antibiotic resistance genes; rDNA, recombinant DNA; rRNA, ribosomal RNA; SNP, single‐nucleotide polymorphism; VFGs, virulence factors genes; WGS, whole genome sequencing.
We selected dozens of bacterial species from two different habitats, the human gut and marine, and generated corresponding simulated sequencing data based on the V3–V4, V4, and V4–V5 regions of 16S ribosomal gene. On the basis of the simulated data, the performance of the amplicon identification program was tested. All the bacteria were identified correctly on the genus level (Table S1).
Pipeline 2: WGS‐based microbial identification
The WGS has been increasingly used in basic research and clinical diagnostics. In our system, we used housekeeping genes and Average Nucleotide Identity (ANI) to identify microbial species and infer their phylogenetic relationships with others. The pipeline contains six modules: Quality control, Assembly, Gene prediction, ANI calculation, Annotation, and MLST.
Fastp was used for quality control and cleaning the PE FASTQ reads. In the assembly process, SPAdes v3.11 [33] was used to assemble the genome, but for some contaminated samples, the metaSPAdes v3.10 [34] was used for contaminated sample assembly. BUSCO v5.1 [35] was used to evaluate the completeness and contamination of the genomes.
We used Prodigal to predict the open reading frames and then translated them into protein products. HMMER v3.1b [36] was used to find the 31 single‐copy housekeeping genes (for genes list, see genome database curation) in the genome. The databases CARD v3.1.3 and carbohydrate‐active enzymes (CAZy) (202001 updated) [37] were used separately to identify the possible ARGs and CAZy, with the parameter of e‐value > 1e − 5. The database virulence factor database (VFDB) 2022 is used to identify potential virulence factors for the identified pathogen strain.
The strategy for bacterial identification was as follows:
-
1.
Extracting the sequences of the single‐copy housekeeping genes from predicted genes after HMM search against 31 single‐copy housekeeping genes profiles.
-
2.
Blasting each of the housekeeping genes against the 31 single‐copy housekeeping genes database and keeping the top 200 blast results for each gene under e‐value > 1e − 5 with the same score and identity. For each species in the database, we then counted the number of housekeeping genes that included the species in the blast results and ranked species based on the number. By default, the pipeline filtered out the species with the counted number of housekeeping genes less than 15, but this value can be modified by users. So our strategy can identify not only the cultured individual microbes but also the contaminating samples.
-
3.
The ANI value was calculated between the genome of the sample and each genome of the species selected from the above method, and only the maximum ANI value of a species was reported. For some species that contained too many strains, we chose up to 1000 strains for ANI calculation.
-
4.
Barrnap v0.9 (https://github.com/tseemann/barrnap) was used to predict 16S rDNA. The phylogenetic tree of 16S rDNA and housekeeping genes was built using IQ‐TREE v1.6.12 [38].
Further, if the species identified were included in the PubMLST database (http://pubmlst.org) [39], the molecular typing of the sample was analyzed automatically. The workflow is illustrated in Figure 2C.
This workflow was applied to analyze a sample, downloaded from the National Center for Biotechnology Information Short Read Archive database under accession number: SRR12560292. The sample data contained 1,418,820 reads, which produced 46 scaffolds, and the length of the assembly was 2.76 Mbp. The 31 single‐copy housekeeping genes were enriched in Staphylococcus aureus, and the S. aureus S3 was the most related strain in the database. The MLST type was ST22, and a total of 142 genes were identified as having a role in the resistance to various antibiotics in CARD and 462 virulence factors in this sample (Figure 3).
Figure 3.
An example of microbial identification by WGS. (A) The ANI and coverage of strains. (B) A phylogenetic tree of 16S rDNA genes of the target strain and the related strains. (C) The statistics of virulent factor genes. (D) The statistics of CAZyme genes. NGS raw data with SRA accession number SRR12560292 was used. ANI, Average Nucleotide Identity; CAZyme, carbohydrate‐active enzyme; NGS, next‐generation sequencing; rDNA, recombinant DNA; SRA, Short Read Archive; WGS, whole genome sequencing.
The genomes of 560 ATCC standard strains were downloaded to test the accuracy of our identification procedure. There were only five genomes whose identification results were inconsistent with their own names. Through careful analysis, it was found that three of them were caused by the naming error of the reference species in the database (GTDB database has corrected their names based on WGS). The other two had disputes about the nomenclature of the representative strains. However, all of our identifications came from the highest‐scoring genomes in the database (Tables S2–S4).
Pipeline 3: SNP source tracking
In practice, in addition to microbial species identification, we also need to analyze the evolutionary relationship between different isolates of a certain species. For example, in a pharmaceutical factory environment, we can determine the source of strain contamination by analyzing the evolutionary distance between isolates.
Two modes for microbial traceability by SNP phylogeny are integrated into the system, which are implemented through the software EToki v1.2 [40] and kSNP v3.0 [18], respectively. In the EToki mode, SNPs are called by comparing genomes to a reference genome, and the derived consensus sequence file is used to create an ML phylogeny. The kSNP is a program for SNP identification and phylogenetic analysis without genome alignment or the requirement for reference genomes, which is more useful when the concerned microorganisms are unculturable or have a large intraspecies evolutionary distance. In addition, a phylogenetic tree view is provided in both modes. The workflow is illustrated in Figure 2B.
Amplicon database and genome reference database
SILVA v138 and UNITE v8.0 [41] are integrated as the source of the amplicon reference database used in the microbial identification by 16S rDNA/18S rDNA/ITS pipeline and the microbial community diversity analysis pipeline. Details of the reference database are described in Table 1.
Table 1.
Details of the reference database for amplicon sequencing‐based pipelines.
Species | Conserved region | Source | Number of sequences | Number of species |
---|---|---|---|---|
Bacteria and Archaea | 16S rDNA | SILVA 138 | 452,421 | 82,905 |
Eukaryote | 18S rDNA | SILVA 138 | 58,562 | 35,575 |
Fungi | ITS region | UNITE 8.0 | 887,397 | 33,032 |
Abbreviation: rDNA, recombinant DNA.
In addition, we built a housekeeping gene database covering 223,491 bacterial RefSeq [42] genomes for fast and accurate profiling of microbial identification in the WGS workflow. Genes with the same name or product of 31 single‐copy housekeeping genes (dnaG, frr, infC, nusA, pgk, pyrG, rplA, rplB, rplC, rplD, rplE, rplF, rplK, rplL, rplM, rplN, rplP, rplS, rplT, rpmA, rpoB, rpsB, rpsC, rpsE, rpsI, rpsJ, rpsK, rpsM, rpsS, smpB, and tsf) were extracted from each genome to construct the full database, which contains 6,855,279 amino acid sequences in total. The 31 single‐copy housekeeping genes database was used to identify probable species in the WGS pipeline.
CONCLUSION
WGS, amplicon sequencing, and metagenomic sequencing are increasingly used in research to produce complicated environmental sequence data sets, which paved the way for a cultivation‐independent genetic content assessment and exploitation of the entire communities of organisms [4, 42, 43, 44]. Therefore, it is urgent to develop WGS and amplicon‐based microbial species identification pipelines in the field of food safety and drug control. Here, we provide a system to analyze the WGS, amplicon sequences for microbial identification, MLST typing, and SNP source tracking. In our system, one important potential use of the WGS microbial identification pipeline is to identify contaminated sequences or metagenome samples. Simultaneously, it has great value in speeding up pathogen detection in clinical laboratories, while the existing identification and taxonomy methods may be unreliable with contaminated samples.
AUTHOR CONTRIBUTIONS
Meicheng Yang, Feng Qin, and Yi Ren conceived the system and idea. Linmeng Liu and Hao Gao implemented the MIST main code. Chang Han and Dan Zhang designed the graphical user interface. Minghui Song, Chang Han, and Linmeng Liu wrote the manuscript. Yi Ren, Chang Han, Qiongqiong Li, and Yiling Fan were responsible for editing and revising the manuscript. All authors contributed to the development of MIST.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflict of interest.
Supporting information
Supporting information.
ACKNOWLEDGMENTS
We are grateful to Zhuo Yang for the graphical user interface development. This work was supported by the grants from the Science and Technology Commission of Shanghai Municipality (22142201600 and 20DZ2293600), the Open Fund Project of NMPA Key Laboratory for Testing Technology of Pharmaceutical Microbiology (2021‐WSW‐01), and the Standard Improvement Project of Chinese Pharmacopoeia Commission (2022Y21 and 2023Y36).
Song, Minghui , Han Chang, Liu Linmeng, Li Qiongqiong, Fan Yiling, Gao Hao, Zhang Dan, Ren Yi, Qin Feng, and Yang Meicheng. 2023. “MIST: A Microbial Identification and Source Tracking System for Next‐Generation Sequencing Data.” iMeta 2, e146. 10.1002/imt2.146
Minghui Song, Chang Han, and Linmeng Liu contribute equally.
Contributor Information
Chang Han, Email: goodluckhc@163.com.
Yi Ren, Email: upforpunkin@gmail.com.
Feng Qin, Email: sifdcqinf@163.com.
Meicheng Yang, Email: yangmeicheng@vip.sina.com.
DATA AVAILABILITY STATEMENT
Supplementary materials (tables, scripts, graphical abstracts, slides, videos, Chinese translated version, and update materials) may be found in the online DOI or iMeta Science http://www.imeta.science/.
REFERENCES
- 1. Rosselló‐Móra, Ramon , and Amann Rudolf. 2015. “Past and Future Species Definitions for Bacteria and Archaea.” Systematic and Applied Microbiology 38: 209–216. 10.1016/j.syapm.2015.02.001 [DOI] [PubMed] [Google Scholar]
- 2. Barco, R. A. , Garrity G. M., Scott J. J., Amend J. P., Nealson K. H., and Emerson D.. 2020. “A Genus Definition for Bacteria and Archaea Based on a Standard Genome Relatedness Index.” mBio 11: 02475–02419. 10.1128/mbio.02475-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Sleator, Roy D. , Shortall C., and Hill C.. 2008. “Metagenomics.” Letters in Applied Microbiology 47: 361–366. 10.1111/j.1472-765X.2008.02444.x [DOI] [PubMed] [Google Scholar]
- 4. Simon, Carola , and Daniel Rolf. 2011. “Metagenomic Analyses: Past and Future Trends.” Applied and Environmental Microbiology 77: 1153–1161. 10.1128/AEM.02345-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Drancourt, M. , Berger P., and Raoult D.. 2004. “Systematic 16S rRNA Gene Sequencing of Atypical Clinical Isolates Identified 27 New Bacterial Species Associated With Humans.” Journal of Clinical Microbiology 42: 2197–2202. 10.1128/jcm.42.5.2197-2202.2004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Chun, Jongsik , Oren Aharon, Ventosa Antonio, Christensen Henrik, Arahal David Ruiz, da Costa Milton S., Rooney Alejandro P., et al. 2018. “Proposed Minimal Standards for the Use of Genome Data for the Taxonomy of Prokaryotes.” International Journal of Systematic and Evolutionary Microbiology 68: 461–466. 10.1099/ijsem.0.002516 [DOI] [PubMed] [Google Scholar]
- 7. Yarza, Pablo , Yilmaz Pelin, Pruesse Elmar, Glöckner Frank Oliver, Ludwig Wolfgang, Schleifer Karl‐Heinz, Whitman William B., et al. 2014. “Uniting the Classification of Cultured and Uncultured Bacteria and Archaea Using 16S rRNA Gene Sequences.” Nature Reviews Microbiology 12: 635–645. 10.1038/nrmicro3330 [DOI] [PubMed] [Google Scholar]
- 8. Xavier, Basil B. , Mysara Mohamed, Bolzan Mattia, Ribeiro‐Gonçalves Bruno, Alako Blaise T. F., Harrison Peter, Lammens Christine, et al. 2020. “BacPipe: A Rapid, User‐Friendly Whole‐Genome Sequencing Pipeline for Clinical Diagnostic Bacteriology.” iScience 23: 100769. 10.1016/j.isci.2019.100769 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Woo, Patrick C. Y. , Ng Kenneth H. L., Lau Susanna K. P., Yip Kam‐tong, Fung Ami M. Y., Leung Kit‐wah, Tam Dorothy M. W., Que Tak‐lun, and Yuen Kwok‐yung. 2003. “Usefulness of the Microseq 500 16S Ribosomal DNA‐Based Bacterial Identification System for Identification of Clinically Significant Bacterial Isolates With Ambiguous Biochemical Profiles.” Journal of Clinical Microbiology 41: 1996–2001. 10.1128/jcm.41.5.1996-2001.2003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Cole, James R. , Wang Qiong, Fish Jordan A., Chai Benli, McGarrell Donna M., Sun Yanni, Brown C. Titus, Porras‐Alfaro Andrea, Kuske Cheryl R., and Tiedje James M.. 2014. “Ribosomal Database Project: Data and Tools for High Throughput rRNA Analysis.” Nucleic Acids Research 42: D633–D642. 10.1093/nar/gkt1244 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Quast, Christian , Pruesse Elmar, Yilmaz Pelin, Gerken Jan, Schweer Timmy, Yarza Pablo, Peplies Jörg, and Glöckner Frank Oliver. 2012. “The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web‐Based Tools.” Nucleic Acids Research 41: D590–D596. 10.1093/nar/gks1219 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Petti, Cathy A . 2007. “Detection and Identification of Microorganisms by Gene Amplification and Sequencing.” Clinical Infectious Diseases 44: 1108–1114. 10.1086/512818 [DOI] [PubMed] [Google Scholar]
- 13. Alcock, Brian P. , Raphenya Amogelang R., Lau Tammy T. Y., Tsang Kara K., Bouchard Mégane, Edalatmand Arman, Huynh William, et al. 2019. “CARD 2020: Antibiotic Resistome Surveillance With the Comprehensive Antibiotic Resistance Database.” Nucleic Acids Research 48: D517–D525. 10.1093/nar/gkz935 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Hunt, Martin , Mather Alison E., Sánchez‐Busó Leonor, Page Andrew J., Parkhill Julian, Keane Jacqueline A., and Harris Simon R.. 2017. “ARIBA: Rapid Antimicrobial Resistance Genotyping Directly from Sequencing Reads.” Microbial Genomics 3: e000131. 10.1099/mgen.0.000131 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Chen, Lihong , Zheng Dandan, Liu Bo, Yang Jian, and Jin Qi. 2015. “VFDB 2016: Hierarchical and Refined Dataset for Big Data analysis—10 Years on.” Nucleic Acids Research 44: D694–D697. 10.1093/nar/gkv1239 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Jolley, Keith A. , Bray James E., and Maiden Martin C. J.. 2018. “Open‐Access Bacterial Population Genomics: BIGSdb Software, the PubMLST.org Website and Their Applications.” Wellcome Open Research 3: 124. 10.12688/wellcomeopenres.14826.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Jagadeesan, Balamurugan , Baert Leen, Wiedmann Martin, and Orsi Renato H.. 2019. “Comparative Analysis of Tools and Approaches for Source Tracking Listeria Monocytogenes in a Food Facility Using Whole‐Genome Sequence Data.” Frontiers in Microbiology 10: 947. 10.3389/fmicb.2019.00947 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Gardner, Shea N. , Slezak Tom, and Hall Barry G.. 2015. “kSNP3.0: SNP Detection and Phylogenetic Analysis of Genomes Without Genome Alignment or Reference Genome.” Bioinformatics 31: 2877–2878. 10.1093/bioinformatics/btv271 [DOI] [PubMed] [Google Scholar]
- 19. Altmann, André , Weber Peter, Bader Daniel, Preuß Michael, Binder Elisabeth B., and Müller‐Myhsok Bertram. 2012. “A Beginners Guide to SNP Calling from High‐Throughput DNA‐sequencing Data.” Human Genetics 131: 1541–1554. 10.1007/s00439-012-1213-z [DOI] [PubMed] [Google Scholar]
- 20. Gardner, Shea N. , and Hall Barry G.. 2013. “When Whole‐Genome Alignments Just Won't Work: kSNP v2 Software for Alignment‐Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes.” PLoS ONE 8: e81760. 10.1371/journal.pone.0081760 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Ruan, Zhi , and Feng Ye. 2016. “BacWGSTdb, a Database for Genotyping and Source Tracking Bacterial Pathogens.” Nucleic Acids Research 44: D682–D687. 10.1093/nar/gkv1004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Chen, Tong , Liu Yong‐Xin, and Huang Luqi. 2022. “ImageGP: An Easy‐to‐Use Data Visualization Web Server for Scientific Researchers.” iMeta 1: e5. 10.1002/imt2.5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Thomsen, Martin Christen Frølund , Ahrenfeldt Johanne, Cisneros Jose Luis Bellod, Jurtz Vanessa, Larsen Mette Voldby, Hasman Henrik, Aarestrup Frank Møller, and Lund Ole. 2016. “A Bacterial Analysis Platform: An Integrated System for Analysing Bacterial Whole Genome Sequencing Data for Clinical Diagnostics and Surveillance.” PLoS ONE 11: e0157718. 10.1371/journal.pone.0157718 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Bolyen, Evan , Rideout Jai Ram, Dillon Matthew R., Bokulich Nicholas A., Abnet Christian C., Al‐Ghalith Gabriel A., Alexander Harriet, et al. 2019. “Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using QIIME 2.” Nature Biotechnology 37: 852–857. 10.1038/s41587-019-0209-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Liu, Yong‐Xin , Chen Lei, Ma Tengfei, Li Xiaofang, Zheng Maosheng, Zhou Xin, Chen Liang, et al. 2023. “EasyAmplicon: An Easy‐to‐Use, Open‐Source, Reproducible, and Community‐Based Pipeline for Amplicon Data Analysis in Microbiome Research.” iMeta 2: e83. 10.1002/imt2.83 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Theil, Sebastien , and Rifa Etienne. 2021. “rANOMALY: Amplicon Workflow for Microbial Community Analysis.” F1000Research 10: 7. 10.12688/f1000research.27268.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Deurenberg, Ruud H. , Bathoorn Erik, Chlebowicz Monika A., Couto Natacha, Ferdous Mithila, García‐Cobos Silvia, Kooistra‐Smid Anna M. D., et al. 2017. “Application of Next Generation Sequencing in Clinical Microbiology and Infection Prevention.” Journal of Biotechnology 243: 16–24. 10.1016/j.jbiotec.2016.12.022 [DOI] [PubMed] [Google Scholar]
- 28. Muir, Paul , Li Shantao, Lou Shaoke, Wang Daifeng, Spakowicz Daniel J., Salichos Leonidas, Zhang Jing, et al. 2016. “The Real Cost of Sequencing: Scaling Computation to Keep Pace With Data Generation.” Genome Biology 17: 53. 10.1186/s13059-016-0917-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Chen, Shifu , Zhou Yanqing, Chen Yaru, and Gu Jia. 2018. “fastp: An Ultra‐Fast All‐in‐One FASTQ Preprocessor.” Bioinformatics 34: i884–i890. 10.1093/bioinformatics/bty560 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Rognes, Torbjørn , Flouri Tomáš, Nichols Ben, Quince Christopher, and Mahé Frédéric. 2016. “VSEARCH: A Versatile Open Source Tool for Metagenomics.” PeerJ 4: e2584. 10.7717/peerj.2584 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Nearing, Jacob T. , Douglas Gavin M., Comeau André M, and Langille Morgan G. I.. 2018. “Denoising the Denoisers: An Independent Evaluation of Microbiome Sequence Error‐Correction Approaches.” PeerJ 6: e5364. 10.7717/peerj.5364 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. McGinnis, Scott , and Madden Thomas L.. 2004. “BLAST: At the Core of a Powerful and Diverse Set of Sequence Analysis Tools.” Nucleic Acids Research 32: W20–W25. 10.1093/nar/gkh435 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Prjibelski, Andrey , Antipov Dmitry, Meleshko Dmitry, Lapidus Alla, and Korobeynikov Anton. 2020. “Using SPAdes De Novo Assembler.” Current Protocols in Bioinformatics 70: e102. 10.1002/cpbi.102 [DOI] [PubMed] [Google Scholar]
- 34. Nurk, Sergey , Meleshko Dmitry, Korobeynikov Anton, and Pevzner Pavel A.. 2017. “metaSPAdes: A New Versatile Metagenomic Assembler.” Genome Research 27: 824–834. 10.1101/gr.213959.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Seppey, Mathieu , Manni Mosè, and Zdobnov Evgeny M.. 2019. “BUSCO: Assessing Genome Assembly and Annotation Completeness.” Methods in Molecular Biology 1962: 227–245. 10.1007/978-1-4939-9173-0_14 [DOI] [PubMed] [Google Scholar]
- 36. Potter, Simon C. , Luciani Aurélien, Eddy Sean R., Park Youngmi, Lopez Rodrigo, and Finn Robert D.. 2018. “HMMER Web Server: 2018 Update.” Nucleic Acids Research 46: W200–W204. 10.1093/nar/gky448 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Lombard, Vincent , Golaconda Ramulu Hemalatha, Drula Elodie, Coutinho Pedro M., and Henrissat Bernard. 2014. “The Carbohydrate‐Active Enzymes Database (CAZy) in 2013.” Nucleic Acids Research 42: D490–D495. 10.1093/nar/gkt1178 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Nguyen, Lam‐Tung , Schmidt Heiko A., von Haeseler Arndt, and Minh Bui Quang. 2015. “IQ‐TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum‐Likelihood Phylogenies.” Molecular Biology and Evolution 32: 268–274. 10.1093/molbev/msu300 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Margos, Gabriele , Binder Katrin, Dzaferovic Eldina, Hizo‐Teufel Cecilia, Sing Andreas, Wildner Manfred, Fingerle Volker, and Jolley Keith A.. 2015. “PubMLST.org—The New Home for the Borrelia MLSA Database.” Ticks and Tick‐Borne Diseases 6: 869–871. 10.1016/j.ttbdis.2015.06.007 [DOI] [PubMed] [Google Scholar]
- 40. Zhou, Zhemin , Alikhan Nabil‐Fareed, Mohamed Khaled, Fan Yulei, and Achtman Mark. 2020. “The EnteroBase User's Guide, With Case Studies on Salmonella Transmissions, Yersinia pestis Phylogeny, and Escherichia Core Genomic Diversity.” Genome Research 30: 138–152. 10.1101/gr.251678.119 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Nilsson, Rolf Henrik , Larsson Karl‐Henrik, Taylor Andy F. S., Bengtsson‐Palme Johan, Jeppesen Thomas S., Schigel Dmitry, Kennedy Peter, et al. 2019. “The UNITE Database for Molecular Identification of Fungi: Handling Dark Taxa and Parallel Taxonomic Classifications.” Nucleic Acids Research 47: D259–D264. 10.1093/nar/gky1022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Tatusova, Tatiana , Ciufo Stacy, Federhen Scott, Fedorov Boris, McVeigh Richard, O'Neill Kathleen, Tolstoy Igor, and Zaslavsky Leonid. 2015. “Update on RefSeq Microbial Genomes Resources.” Nucleic Acids Research 43: D599–D605. 10.1093/nar/gku1062 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Tagini, F. , and Greub G.. 2017. “Bacterial Genome Sequencing in Clinical Microbiology: A Pathogen‐Oriented Review.” European Journal of Clinical Microbiology & Infectious Diseases 36: 2007–2020. 10.1007/s10096-017-3024-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Thomas, Torsten , Gilbert Jack, and Meyer Folker. 2012. “Metagenomics—A Guide from Sampling to Data Analysis.” Microbial Informatics and Experimentation 2: 3. 10.1186/2042-5783-2-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting information.
Data Availability Statement
Supplementary materials (tables, scripts, graphical abstracts, slides, videos, Chinese translated version, and update materials) may be found in the online DOI or iMeta Science http://www.imeta.science/.