Abstract
Over half of community-acquired pneumonia cases are caused by a few dozen bacterial species, and accurate identification of these pathogens is essential for effective treatment. In this study, we developed a reliable diagnostic method using 16S ribosomal RNA (16S rRNA) sequencing, considering intra-species variation, the need to differentiate Streptococcus pneumoniae from oral α-hemolytic streptococci, and applicability to the battlefield hypothesis, which helps distinguish true pathogens from commensal organisms that are not causative pathogens. We designed specific primers and a BLAST wrapper program, Cheryblast + ob, to classify 37 pneumonia-causing bacteria and 4 α-hemolytic streptococci. In simulation experiments involving a total of 20,309 copies of the 16S rRNA from 41 species of bacteria deposited in Genbank, the algorithm achieved a sensitivity greater than 0.996 and a specificity of 1.000. It was robust against sequencing errors and successfully distinguished S. pneumoniae from closely related species. In an experiment using next-generation sequencing on artificial mixtures containing genomic DNA from 10 bacterial species and human DNA at varying two-fold ratios, the species with the highest copy number was correctly identified in 8 out of 11 samples, and the top two species by copy number were identified in all 11 samples. This high-performance method offers a promising tool for accurate pneumonia diagnosis and could also be applied to other infections in which a limited number of bacterial species must be reliably identified.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-14841-z.
Keywords: Pneumonia, 16S ribosomal RNA, Next-generation sequencing, Streptococcus pneumoniae, Battlefield hypothesis
Subject terms: Infectious-disease diagnostics, Infectious diseases, Respiratory tract diseases
Introduction
Community-acquired pneumonia (hereafter referred to as pneumonia) is one of the leading causes of death worldwide. Clinically, it is often characterized by (1) acute symptoms such as fever, cough, sputum, and chest pain, and (2) the appearance of infiltrative shadows in imaging studies1,2. After diagnosis, the search for the causative pathogens is typically performed using airway secretions, such as sputum. Although a few dozen bacteria are responsible for over half of pneumonia cases, pathogen identification remains challenging, with the causative agent remaining unknown in about one-third of patients3.
The causative pathogens of pneumonia can be broadly classified into commensal and non-commensal organisms. M. tuberculosis and M. pneumoniae are non-commensal organisms; therefore, their detection in respiratory secretions strongly indicates causality. In contrast, the major causative agents of pneumonia—such as S. pneumoniae, S. aureus, H. influenzae, P. aeruginosa, and M. catarrhalis—are commensal organisms. When such organisms are detected in airway secretions, it becomes essential to determine whether they are true pathogens or merely colonizers. This distinction is a major challenge in pneumonia diagnostics. To address this issue, we previously proposed the “battlefield hypothesis”4–6 which states that the ratio of bacteria to white blood cells in respiratory secretions reflects bacterial dominance at the site of inflammation. A higher ratio suggests a greater likelihood that the bacteria are causative pathogens. This hypothesis provides a means to assess the clinical relevance of detected bacteria.
The 16S ribosomal RNA (16 S rRNA) sequence is a powerful tool for bacterial identification. Due to its high conservation among prokaryotes, a small set of PCR primers can amplify 16 S rRNA from a broad spectrum of bacterial species. Although 16 S rRNA sequencing may be too time-consuming for immediate treatment decisions, it is invaluable for epidemiological investigations of pneumonia. Furthermore, if no known pathogen is identified via 16 S rRNA analysis, it may point to involvement of atypical bacteria or viral pathogens.
However, several important factors must be considered when using 16 S rRNA to identify pneumonia pathogens: (1) Even within a single bacterial species, 16 S rRNA sequences can exhibit intra-species variation, which the identification algorithm must accommodate. (2) The 16 S rRNA sequence of S. pneumoniae, the most common cause of pneumonia, closely resembles that of α-hemolytic streptococci, which are common oral commensals7. Therefore, accurate differentiation between these sequences is critical.
The aim of this study was to develop PCR primers and an analysis program that address these challenges: (1) robust species-level identification that considers intra-species variation, (2) precise discrimination between S. pneumoniae and α-hemolytic streptococci, and (3) compatibility with the battlefield hypothesis to assess the clinical relevance of detected organisms.
Methods
Installation of BLAST, clustal omega and ruby
BLAST (blastn and makeblastdb version 2.8.1+) was downloaded from the NCBI website https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html. It was then stored in the /blast_directory/bin directory created on the local computer. Clustal Omega was downloaded from the Clustal Omega website http://www.clustal.org/omega/#Download. Ruby was downloaded from the Ruby website https://www.ruby-lang.org/en/downloads/. They were installed as instructed in each web site.
16S rRNA nucleotide sequences
The 16 S rRNA nucleotide sequences were obtained from the nucleotide database in GenBank. In GenBank, many 16 S rRNA entries have only partial sequences registered. To obtain full-length sequences, we first searched GenBank using queries that included the bacterial name and the word “complete” in the title, and “16S” anywhere in the entry. We then extracted the 16 S rRNA regions from GenBank records in which the FEATURE field was annotated as rRNA with the product labeled “16S ribosomal RNA.” These steps were repeated, and we retrieved up to the first 1000 sequences. For M. gordonae, which had only a small number of entries, we used sequences derived from whole genome sequencing data along with all GenBank rRNA entries for M. gordonae that were longer than 1400 bp. In summary, we obtained a total of 20,309 copies of the 16 S rRNA from 41 species of bacteria.
16S rRNA consensus sequence
The 16 S rRNA sequences that belong to a bacterial species were aligned using Clustal Omega to obtain the consensus sequence for that species. Subsequently, the consensus sequences for all bacterial species were aligned using Clustal Omega. We designed PCR primers based on the aligned sequences.
Match rate between PCR primers and 16S rRNA sequences
The match between either of the PCR primers and the 16 S rRNA sequence was measured using the Levenshtein distance. A Levenshtein distance of 0 indicates a perfect match, 1 indicates a single mismatch, and 2 or more indicates a mismatch of 2 or more bases.
Local database of pneumonia-causing bacteria
The sequences flanked by PCR primers from the consensus sequences of each bacterial species were collected and used to create a local database of pneumonia-causing bacteria. In addition, we also registered the consensus sequences of four α-hemolytic streptococci (S. mitis, S. oralis, S. salivarius, and S. sanguinis) to differentiate S. pneumoniae from them, and the sequence of the human HSD11B2 gene for the battlefield hypothesis. The specificity of the S. pneumoniae-specific nucleotide sequence was further confirmed in six additional species of α-hemolytic streptococci (S. anginosus, S. constellatus, S. gordonii, S. intermedius, S. parasanguinis, and S. vestibularis).
Bacterial and human genomic DNA
The genomic DNA of Escherichia coli (12713G), Pseudomonas aeruginosa (106052G), and Klebsiella pneumoniae subsp. pneumoniae (14940G) were purchased from the Biological Resource Center, NITE (Chiba, Japan). The genomic DNA of Staphylococcus aureus subsp. aureus Rosenbach ATCC 700699D-5, Streptococcus pneumoniae (Klein) Chester ATCC BAA-255D-5, Moraxella catarrhalis (Frosch and Kolle) Bovre ATCC 25240D-5, Haemophilus influenzae (Lehmann and Neumann) Winslow et al. ATCC 51907D, Proteus mirabilis Hauser ATCC 12453D, Acinetobacter baumannii ATCC BAA-1605D-5, and Mycobacterium tuberculosis (Zopf) Lehmann and Neumann ATCC 27294D-2, were purchased from the American Type Culture Collection (Rockville, MD, USA). Human placental genomic DNA was purchased from Promega Corp. (Madison, WI, USA).
Nucleotide sequencing by a next-generation sequencer
Two sets of primer mixes that simultaneously amplify 16S rRNAs and human HSD11B2 gene were prepared. The test solution was divided into two portions, PCR-amplified with both primer mixes, and then combined. Each reaction contained a primer mix (250 nM each), 1× PrimeSTAR GXL Buffer (Takara Bio Inc., Kyoto, Japan), 200 nM dNTPs, and 0.625 U PrimeSTAR GXL DNA Polymerase in a final volume of 25 µL. The cycling parameters were 94 °C for 120 s, followed by 30 cycles of 94 °C for 10 s, 55 °C for 15 s, and 68 °C for 35 s. The amplified products were purified using the Agencourt AMPure XP system (Beckman Coulter, CA, USA), and the nucleotide sequences were determined as described in a previous report (Inoue, Shiihara et al. 2017) using the MiSeq Reagent Kit V3 (Illumina). In that report, a second PCR was performed with primers that included an adapter sequence for loading onto the MiSeq, followed by a third PCR with indexed primers to distinguish multiple samples.
Cheryblast + ob analysis
Cheryblast + ob (check sequence query with BLAST + for bacteria: Deposited in 10.5281/zenodo.14759736) is a wrapper program for BLAST that we developed, incorporating a novel algorithm of our own design. It uses BLAST for the nucleotide homology analysis. The FASTQ files output from the MiSeq were analysed using Cheryblast + ob. Each read was classified into 37 types of pneumonia-causing bacteria 16 S rRNA, 4 α-hemolytic streptococci 16S rRNAs, the human HSD11B2 gene, or sequences that did not match any of these categories.
Effect of sequencing errors
To investigate the impact of sequencing errors on the classification of 16 S rRNA, we introduced mutations at rates ranging from 0 bp/500 bp to 5 bp/500 bp into the 16S rRNA sequences downloaded from GenBank. We then identified bacterial species using Cheryblast + ob. This process was repeated 100 times to calculate the average sensitivity and specificity for the identification of bacterial species.
A simulated reaction using DNA mixtures mimicking airway secretions
A two-fold dilution series of genomic DNA was created, starting with 65,000 copies of DNA derived from A. baumannii, E. coli, H. influenzae, K. pneumoniae, M. tuberculosis, M. catarrhalis, P. aeruginosa, P. mirabilis, S. aureus, S. pneumoniae, or human DNA. We set 65,000 copies as the target and used 32,500 copies, 16,250 copies, 8,125 copies, 4,063 copies, and 2,032 copies as competitors. Samples were prepared by arbitrarily selecting and mixing one target and one competitor of 32,500 copies, one competitor of 16,250 copies, and so on. These samples were then amplified using the PCR primers set in this study, sequenced with MiSeq, and analysed with Cheryblast + ob.
Results
Bacteria
A literature search was conducted to identify 37 representative bacterial species known to cause pneumonia (Supplementary Table S1)1. Additionally, four representative oral α-hemolytic streptococci were selected to distinguish them from S. pneumoniae. (Supplementary Table S1).
Analysis flowchart
The overall analysis process is illustrated in (Fig. 1). DNA extracted from airway secretions is used to amplify 16 S rRNAs and the human HSD11B2 gene through multiplex PCR, followed by nucleotide sequencing using the next-generation sequencer, MiSeq. The determined nucleotide sequences are matched against a local database containing the 16 S rRNA sequences of 41 bacteria and the human HSD11B2 gene sequence (Supplementary Table S2) using the NCBI BLAST program. Due to the high intra-species similarity of 16 S rRNA sequences in the Streptococcus and Mycobacterium genera, PCR amplicons were initially classified as belonging to the Streptococcus genus, Mycobacterium genus, other individual bacterial species, the human HSD11B2 gene, or as unmatched sequences. Streptococcus and Mycobacterium were then further classified at the species level based on the presence of species-specific nucleotide sequences.
Fig. 1.
Analysis flowchart. A flowchart from the DNA extraction to the application of the battlefield hypothesis. Each bacterium is differentiated using the bit score from NCBI BLAST. Streptococci and Mycobacteria are further differentiated using species-specific sequences. In this study, we optimized our program using 16 S rRNA sequences downloaded from the NCBI website and simulated the post-DNA extraction steps using bacterial DNA purchased from ATCC or commercially available human genomic DNA.
Design of PCR primers
16 S rRNA sequences from 41 bacterial species, including 4 types of α-hemolytic streptococci, were obtained from GenBank, totaling 20,309 sequences. For each species, 1,000 sequences were retrieved, or the maximum available number if fewer were present. The consensus sequences of 16 S rRNA for each bacterial species were determined then the alignment of the consensus sequences across all bacterial species was performed (Fig. 2; Supplementary Table S3). Regions with high proportions of identical nucleotides were identified as potential primer sites. Regions I and II, where the amplicon length was within the sequencing capability of MiSeq (less than 600 bp) were selected as the amplicon candidates (Fig. 2). The nucleotide sequence between the primers needed to exhibit sufficient variability to distinguish each bacterium. Further analysis revealed that Region II did not provide adequate differentiation for some bacteria (data not shown), thus Region I was designated, and PCR primers were finally designed (Supplementary Table S4).
Fig. 2.
16S rRNA homology. The fraction of the most frequent nucleotide at each position after aligning the consensus sequences of 41 bacterial species’ 16S rRNA using Clustal Omega. In regions one nucleotide is conserved across all bacteria, the fraction of the most frequent nucleotide is 1.0. If there are no gaps and the four nucleotides A, G, C, and T are present in equal proportions, the fraction of the most frequent nucleotide is 0.25. In positions where gaps occur in many bacteria, the fraction of the most frequent nucleotide becomes less than 0.25. Below the scale, the commonly used V1–V2 and V3–V4 regions are indicated, along with the positions of representative primers that amplify each region8.
Out of the 20,309 copies obtained from GenBank, 19,957 copies had sequences long enough to allow a search for matches with the designed primer. The match rate between the designed primer and the 16 S rRNA was 0.995 for a perfect match and 0.998 when including a single base mismatch. We concluded that the designed primer is suitable for amplifying the entire target sequence.
Subsequently, the segments corresponding to the Region I amplicon from the consensus sequences of 16 S rRNA for each bacterial species, along with the nucleotide sequence of the HSD11B2 gene flanked by its primers, were collected and a local database for BLAST query was created (Supplementary Table S2).
Differentiation algorithm
We extracted one bacterial consensus sequence saved in the local database and compared it using BLAST with 20,309 copies of 16 S rRNA, which were obtained from 41 bacterial species, including 4 α-hemolytic streptococci, all downloaded from GenBank (Supplementary Table S1). We then analysed the distribution of the bit scores. The result obtained using the E. coli consensus sequence is schematically shown (Fig. 3A). The bit score is highest when compared to its own 16 S rRNA, but due to the variation in 16 S rRNA within the E. coli species, the graph exhibits a tail on the left side. The second peak is formed by the 16 S rRNA of K. pneumoniae, which has a high homology with E. coli and its maximum bit score is 780. This indicates that if the bit score between an unknown 16 S rRNA and the E. coli consensus sequence is 781 or higher, then it can be concluded that the 16 S rRNA belongs to E. coli. We repeated the same steps for each consensus sequence in the local database, determining the bit score threshold for identifying each bacterium.
Fig. 3.
Schematic diagram of the distribution of bit scores. (A) When 16 S rRNA sequences from various bacteria are used as queries and the 16 S rRNA consensus sequence of E. coli is used as the subject in a BLAST comparison. (B) When S. pneumoniae is used as the subject.
The 16 S rRNA of S. pneumoniae and oral α-hemolytic streptococci have high homology, thus it is difficult to distinguish between them using bit scores (Fig. 3B). Therefore, we decided to adopt a strategy in which we first isolate four species of α-hemolytic streptococci together with S. pneumoniae as a group from other bacterial species, and then differentiate S. pneumoniae from α-hemolytic streptococci using a species-specific nucleotide sequence. The bacterium with the highest bit score outside the group was S. pyogenes, with a maximum bit score of 632. Therefore, when the bit score was 633 or higher, the 16 S rRNA sequence is either from S. pneumoniae or α-hemolytic streptococci. Then, further differentiation between S. pneumoniae and others was performed using a S. pneumoniae-specific sequence (Table 1). The sequence is not present in the four species analysed here, nor in six other species of oral α-hemolytic streptococci frequently isolated (as explained in Table 1), thus suggesting that S. pneumoniae can be isolated with high specificity.
Table 1.
Species-specific sequence.
Streptococcus genus | |||||
---|---|---|---|---|---|
Species-specific sequence | TGCACTTGCA | ||||
Position in Fig. 2# | 233–242 | ||||
S. pneumoniae (n = 1000) | 1000 | ||||
α-hemolytic streptococci (n = 482) * | 0 | ||||
Mycobacterium genus | |||||
Species-specific sequence | CCTCTTCGGA | TGTCCTGTGGT and GGTGATGG | CGGGGGTACT and ATAGGACCTT | GACACTCGAG | AAAGCGCTTT |
Position in Fig. 2# | 75–100 |
206-229 and 284-2910 |
97-106 and 188-197 |
101–110 | 232–241 |
M. avium (n = 84) | 84 | 0 | 0 | 0 | 0 |
M. gordonae (n = 62) | 0 | 62 | 0 | 0 | 0 |
M. intracellulare (n = 78) | 0 | 0 | 78 | 0 | 0 |
M. kansasii (n = 8) | 0 | 0 | 0 | 8 | 0 |
M. tuberculosis (n = 634) | 0 | 0 | 0 | 0 | 634 |
*α-hemolytic streptococci include S. mitis, S. oralis, S. salivalius, and S. sanguinis. Other α-hemolytic streptococci that include S. anginosus (n = 146), S. constellatus (n = 60), S. gordonii (n = 75), S. intermedius (n = 63), S. parasanguinis (n = 32), and S. vestibularis (n = 7) were additionally investigated and found that “TGCACTTGCA” was specific to S. pneumoniae.
#The locations of species-specific sequences were indicated using nucleotide positions in Fig. 2; all are located within the V1–V2 region.
Using this method, we were able to distinguish 37 species of bacteria plus α-hemolytic streptococci with a sensitivity of > 0.996 and a specificity of 1.000 (Fig. 4).
Fig. 4.
The impact of sequencing errors on 16 S rRNA classification. (A) Sensitivity. (B) Specificity.
Impact of reaction errors
PCR reactions and next-generation sequencer reactions have errors. Errors in PCR reactions can be almost eliminated using high-fidelity DNA polymerase. However, after loading on the MiSeq, an error rate of approximately 0.5 bases per 500 bp amplicon needs to be expected9. We introduced mutations in the range of 0.0–5.0 bases per 500 bases into the 20,309 copies of the 16 S rRNA sequence to examine how sensitivity and specificity are affected (Fig. 4). In S. pneumoniae, a decrease in sensitivity and specificity was observed due to misidentification with α-hemolytic streptococci. In actual identification of the causative bacteria, numerous 16 S rRNA molecules are sequenced. Even if a few 16 S rRNA sequences of S. pneumoniae are classified as α-hemolytic streptococci, this would not affect the interpretation that S. pneumoniae is the causative agent of pneumonia.
A simulated reaction using DNA mixtures mimicking airway secretions
The functionality of the PCR primers and algorithm developed in this study were verified by conducting a simulated reaction (Fig. 5). A two-fold dilution series was prepared starting with 65,000 copies of DNA from A. baumannii, E. coli, H. influenzae, K. pneumoniae, M. tuberculosis, M. catarrhalis, P. aeruginosa, P. mirabilis, S. aureus, S. pneumoniae, or human DNA. A solution containing 65,000 copies of DNA from one genome was chosen as the target and mixed with various competitors at different concentrations (32,500 copies, 16,250 copies, and lower). In this setting, the targets represent pneumonia-causing pathogens, and the competitors represent commensal bacteria from the oral cavity and pharynx, simulating a condition in which respiratory secretions contain both simultaneously. In cases where multiple causative pathogens are present at comparable levels, such as in the case of mixed infections, we expect that our method will be capable of detecting more than one pathogen. The PCR primers designed in this study, along with Miseq sequencing and the Cheryblast + ob algorithm, effectively identified the target DNA even in the presence of competing sequences. Furthermore, the HSD11B2 gene, which is relevant to the Battlefield hypothesis, was clearly differentiated from bacterial 16 S rRNA. These results confirm that the PCR primers and algorithm developed in this study functioned effectively across different DNA mixtures.
Fig. 5.
A simulated reaction using DNA mixtures mimicking airway secretions. The genomic DNA from 65,000 copies of the target species and genomic DNA from 32,500 copies, 16,250 copies, 8,125 copies, 4,063 copies, and 2,032 copies of other species (competitors) were mixed, amplified, sequenced, and classified using the method described in this report. The graph shows the proportion of the next-generation sequencer reads attributed to the target and each competitor. We consider that, in the cases of K. pneumoniae, M. tuberculosis, and S. aureus, the comparable number of reads between the target and competitor is likely due to a slight difference in PCR amplification efficiency, with the competitor being amplified more efficiently. We uploaded the FASTQ files obtained from the simulated reactions to 10.5281/zenodo.14759736.
Discussion
In this study, we developed primer sets and a computer program to identify 37 major causative agents of pneumonia using 16 S rRNA sequences determined by next-generation sequencing. The program is characterized by the following features: (1) it accommodates 16 S rRNA variations within the same species, (2) it has accuracy in differentiating between S. pneumoniae and oral α-hemolytic streptococci, and (3) it is ready for the application of the battlefield hypothesis. The program accurately identified the 37 major causative agents of pneumonia, achieving a sensitivity greater than 0.996 and a specificity of 1.000, based on the analysis of 20,309 copies of 16 S rRNA sequences from the 37 pathogens that are frequent cause of pneumonia and 4 types of α-hemolytic streptococci. Additionally, the program was robust against sequencing errors. In simulated experiments using the genomes of 10 major causative agents of pneumonia and the human genome, the program accurately detected the target genomes.
Although the 16 S rRNA sequence is highly conserved, it is not possible to create PCR primers that are common to all bacteria. Mixing multiple primers, however, reduces the likelihood of achieving a perfect match with the target 16 S rRNA, creating uncertainty about whether the target will be efficiently amplified. Nevertheless, the primer mix used in this study performed well and presents a legitimate candidate system.
A BLAST search in the GenBank database using a 16 S rRNA sequence often detects many bacteria with high bit scores but unlikely to be the causative pathogens, making interpretation difficult. The Cheryblast + ob algorithm addresses this issue by first preparing a list of bacteria likely to cause pneumonia and then selecting the most probable candidates from the list using the 16 S rRNA sequence. Classification initially relies on global similarity, which is less affected by sequencing errors from next-generation sequencers, followed by an additional local sequence search if needed. This approach is particularly effective when the list consists of several dozen bacterial species, as is the case for pneumonia-causing pathogens. If no bacteria that explain the clinical findings are identified, rare or unknown pathogens should be considered.
The V3–V4 hypervariable region of the 16 S rRNA gene is widely used for taxonomic identification. However, we used Region I, which includes the entire V1–V2 region and part of the V3–V4 region. This choice was based on the sequencing capabilities of the MiSeq next-generation sequencer used in this study, which can determine nucleotide sequences up to 600 bp in length. Region I was designed to leverage this read length to enable species-level identification of the 37 bacterial species analysed in this study. The white portion within the blue-colored barcodes shown in Fig. 2 represents regions of low sequence similarity among the bacteria targeted in this study. Region I includes low-similarity segments from both the V1–V2 and V3–V4 regions. Furthermore, as shown in Table 1, the species-specific sequences present in the V1–V2 region enabled accurate discrimination of bacteria at the species level. Therefore, Region I can be considered to combine the strengths of both the V1–V2 and V3–V4 regions. As the number of bacterial species increases, further investigation will be needed to determine how well Cheryblast + ob can continue to distinguish among them. Nonetheless, since Region I combines the advantages of both V1–V2 and V3–V4 regions, it is likely that, when used in conjunction with species-specific sequences, Cheryblast + ob will be capable of accurately differentiating a wide variety of bacterial combinations.
As pneumonia progresses or the causative bacteria are removed by antibiotic treatment, various bacteria may begin to appear in airway secretions due to secondary infections10. Evaluating the contribution of these bacteria to the clinical presentation is extremely difficult, and no standardized method for this assessment has been established. Therefore, this study focuses solely on investigating the causative bacteria at the onset of pneumonia.
In the present study, we only demonstrated analyses using simulated reactions with DNA mixtures mimicking airway secretions. Whether the algorithm developed in this study will function effectively in real clinical specimens, which often contain various contaminants, remains to be clarified in future research. In our previous studies, we have extracted nucleic acids and conducted PCR analyses from numerous clinical specimens of patients with community-acquired pneumonia and otitis media (References6,7,11,12). Since both 16S rRNA and human DNA were simultaneously and successfully amplified by PCR from the nucleic acids extracted in those studies, we believe that nucleic acid extraction should be performed following the methods described in these papers. Once nucleic acids are extracted and PCR is completed, the subsequent analysis can be carried out according to the methods described in this paper. Based on the results of the simulated reactions, favorable outcomes can be expected for many clinical specimens.
Conclusions
In this study, we established PCR primers and a classification algorithm to accurately identify the causative bacteria of pneumonia from the airway secretions of patients with community-acquired pneumonia. Furthermore, we confirmed that the method works well through simulation experiments. Applying the Battlefield hypothesis in clinical setting will require a large clinical trial, which will be addressed in a separate study. This study presents a promising method for accurately classifying several dozen bacterial species.
Limitations
This study has the following limitations:
Dependence on a limited list of bacteria
The algorithm was developed and validated using a predefined list of causative bacterial species. While this approach likely captures most clinically relevant pathogens responsible for community-acquired pneumonia, it may fail to detect rare or emerging organisms not included in the reference set.
Lack of clinical validation
Although the method demonstrated promising results in simulation and in silico analyses, it has not yet been validated in clinical settings. Prospective studies using actual patient samples are needed to evaluate its diagnostic performance in real-world practice.
Uncertainty regarding applicability to more complex data
This study focused on relatively controlled datasets. The algorithm’s robustness and accuracy when applied to more complex and heterogeneous clinical data, such as those involving rare opportunistic pathogens, remain to be determined.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
We would like to express our gratitude to Akemi Okumura and Miyuki Ohtsuka for their administrative support in this research.
Abbreviations
- 16S rRNA
16S ribosomal RNA
- NGS
Next-generation sequencing
Author contributions
F.D.K. and K.H. contributed to the conception and design of this study. F.D.K., D.A., M.S., M.H., and Y.I. were involved in data curation, formal analysis, investigation, and methodology. F.D.K. and K.H. were responsible for project administration, resource management, software development, and supervision. F.D.K. and K.H. drafted the manuscript, while all authors contributed to its review and editing.
Data availability
The software is deposited to https://github.com/hagiwark1957/Cheryblast_plus_ob. The datasets generated and/or analysed during the current study are available in the Zenodo repository, https://doi.org/10.5281/zenodo.14759736, and in the Sequence Read Archive (SRA) repository under accession number PRJNA1299243.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.File, T. M. Community-acquired pneumonia. Lancet362, 1991–2001 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.American Thoracic Society; Infectious Diseases Society of America. Guidelines for the management of adults with hospital-acquired, ventilator-associated, and healthcare-associated pneumonia. Am. J. Respir. Crit. Care Med.171, 388–416 (2005). [DOI] [PubMed] [Google Scholar]
- 3.Levison, M. E. & Pneumonia Including Necrotizing Pulmonary Infections (Lung Abscess). In: Harrison’s Principles of Internal Medicine, 15th Edition. McGraw-Hill, 1457–1464. (2001).
- 4.Hirama, T. et al. Prediction of the pathogens that are the cause of pneumonia by the battlefield hypothesis. PLoS One6, e24474 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hirama, T. et al. HIRA-TAN: a real-time PCR-based system for the rapid identification of causative agents in pneumonia. Respir. Med.108, 395–404 (2014). [DOI] [PubMed] [Google Scholar]
- 6.Kurniawan, F. D., Alia, D., Priyanto, H., Mahdani, W. & Hagiwara, K. HIRA-TAN detects pathogens of pneumonia with a progressive course despite antibiotic treatment. Respir. Investig.. 57, 337–344 (2019). [DOI] [PubMed] [Google Scholar]
- 7.Shoar, S. & Musher, D. M. Etiology of community-acquired pneumonia in adults: a systematic review. Pneumonia (Nathan). 12, 11 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lee, H. B., Jeong, D. H., Cho, B. C. & Park, J. S. Comparative analyses of eight primer sets commonly used to target the bacterial 16S rRNA gene for marine metabarcoding-based studies. Front. Mar. Sci.10, 1199116. 10.3389/fmars.2023.1199116 (2023). [Google Scholar]
- 9.Schirmer, M. et al. Insight into biases and sequencing errors for amplicon sequencing with the illumina miseq platform. Nucleic Acids Res.43, e37 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Musher, D. M., Abers, M. S. & Bartlett, J. G. Evolving Understanding of the causes of pneumonia in adults, with special attention to the role of Pneumococcus. Clin. Infect. Dis.65, 1736–1744 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hirama, T. et al. PCR-Based rapid identification system using bridged nucleic acids for detection of Clarithromycin-Resistant Mycobacterium avium-M. Intracellulare complex isolates. J. Clin. Microbiol.54, 699–704 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Alia, D., Kurniawan, F. D., Ridwan, A., Mahdani, W. & Hagiwara, K. Validating quantitative polymerase chain reaction assay for the molecular diagnosis of chronic suppurative otitis media. Open. Access. Macedonian J. Med. Sci.810.3889/oamjms.2020.3886 (2020).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The software is deposited to https://github.com/hagiwark1957/Cheryblast_plus_ob. The datasets generated and/or analysed during the current study are available in the Zenodo repository, https://doi.org/10.5281/zenodo.14759736, and in the Sequence Read Archive (SRA) repository under accession number PRJNA1299243.