Abstract
Background
Differential etiologies of pediatric acute febrile respiratory illness pose challenges for all populations globally, but especially in malaria-endemic settings because the pathogens responsible overlap in clinical presentation and frequently occur together. Rapid identification of bacterial pneumonia with high-quality diagnostic tools would enable appropriate, point-of-care antibiotic treatment. Current diagnostics are insufficient, and the discovery and development of new tools is needed. We report a unique biomarker signature identified in blood samples to accomplish this.
Methods
Blood samples from 195 pediatric Mozambican patients with clinical pneumonia were analyzed with an aptamer-based, high-dynamic-range, quantitative assay (~1200 proteins). We identified new biomarkers using a training set of samples from patients with established bacterial, viral, or malarial pneumonia. Proteins with significantly variable abundance across etiologies (false discovery rate <0.01) formed the basis for predictive diagnostic models derived from machine learning techniques (Random Forest, Elastic Net). Validation on a dedicated test set of samples was performed.
Results
Significantly different abundances between bacterial and viral infections (219 proteins) and bacterial infections and mixed (viral and malaria) infections (151 proteins) were found. Predictive models achieved >90% sensitivity and >80% specificity, regardless of number of pathogen classes. Bacterial pneumonia was strongly associated with neutrophil markers—in particular, degranulation including HP, LCN2, LTF, MPO, MMP8, PGLYRP1, RETN, SERPINA1, S100A9, and SLPI.
Conclusions
Blood protein signatures highly associated with neutrophil biology reliably differentiated bacterial pneumonia from other causes. With appropriate technology, these markers could provide the basis for a rapid diagnostic for field-based triage for antibiotic treatment of pediatric pneumonia.
Keywords: malaria, pediatric, pneumonia, biomarker, diagnostic
Identification of new biomarkers to distinguish between viral and bacterial pneumonia is needed. These markers could provide the basis for a rapid diagnostic for field-based triage for antibiotic treatment of pediatric pneumonia.
Pediatric febrile respiratory illness is a leading cause of mortality and morbidity globally. Identifying the etiology—bacterial [1], viral, or (less commonly) malaria [2, 3]—is crucially important but difficult due to similar clinical presentations. The critical need globally is to identify bacterial infections [3] so they can be treated appropriately and reduce mortality [4, 5]. Rapid bacterial diagnosis is challenged by current diagnostic tests: laborious microbiological culture or molecular testing methods, if available, often lack sensitivity to detect bacterial pathogens [6] as do radiological evaluations (through chest-X-ray or ultrasound), with equal limitation in availability. Malaria or viral infections and bacterial secondary coinfections occur commonly together, increasing the challenge of a specific, treatable diagnosis [7, 8].
Host cellular responses to bacterial, viral, and malaria infections are distinct, being chiefly neutrophilic, lymphocytic, or monocytic, respectively, and represent prime targets as diagnostic indicators. To date, these approaches are not sufficiently reliable [9–13], based on a recommended benchmark [14] of thresholds for sensitivity (desirable, ≥95%; acceptable, ≥90%) and specificity (≥90% and ≥80%). We hypothesized that the distinctive cellular host responses could be detected at the protein level. We test this hypothesis based on the differential expression of proteins in pediatric febrile respiratory illness blood specimens from southern Mozambique, where malaria is endemic. Febrile respiratory illness cases were classified by available gold standards, and using highly specific case definition, to 1 of 3 underlying causes—bacteria, viruses, or malaria—or to a combination (“mixed infections”). Proteins were assayed with SOMAScan technology (Somalogic, Boulder, CO), an array-based modified aptamer platform covering a range of biological pathways including inflammation, signal transduction, and immune processes. This quantitative assay of approximately 1200 proteins simultaneously offers a high dynamic range and has modest sample requirements (150 μL plasma) [15].
The resulting protein expression data were used to create machine learning–based models for distinguishing bacterial from viral or malaria infections. The same data, along with data from our prior RNA- and protein-based studies [9, 13], provided the basis for pathway analyses, to help confirm the underlying biology of the host response.
METHODS
Study Design
The study recruited 2 groups of children (<10 years of age) at the Manhiça District Hospital in Mozambique as follows: (1) children with febrile respiratory illness admitted to the hospital fulfilling the “clinical pneumonia” criteria (as defined by the World Health Organization [WHO]) and (2) afebrile and asymptomatic healthy community controls used to establish a baseline. Febrile respiratory illness cases were assigned by all available gold-standard tests to 1 of 3 underlying causes—bacteria, viruses, or malaria—or to a combination (“mixed infections”).
Study Population and Sample Classification Procedure
Children with fever at admission (>37.5°C axillary temperature) or prior 24-hour history of fever meeting the WHO case definition for clinical pneumonia (increased respiratory rate and cough or difficulty breathing) [16] were selected for the study. Informed consent was obtained from parents/guardians. All children underwent anteroposterior chest radiography; images were independently interpreted following the WHO-recommended guidelines for pneumonia diagnosis by 2 experienced clinicians [17].
Patients were classified as having clinical pneumonia associated with bacterial, malaria, or viral infection using the criteria described in Valim et al [13], with minor modifications. In brief, patients were classified as bacterial pneumonia when pathogenic bacteria were isolated (or detected through reverse transcription–polymerase chain reaction [RT-PCR]) from blood or pleural exudate, and after confirming the absence of malarial infection. Viral pneumonia required the detection in the nasopharyngeal aspirate (NPA) of a viral respiratory pathogen, no isolated bacteria in the blood culture or RT-PCR, no “endpoint pneumonia” in the chest X-ray, and negative malaria microscopy. Finally, a malaria case required a positive malaria smear microscopy (according to predetermined parasitemia thresholds in relation to age [18]), normal chest X-ray, and no detectable bacterial infection. We analyzed our case definitions against ALMANACH criteria (Supplementary Material, Supplementary Methods and Supplementary Table 7).
To address the known insensitivity of blood culture for bacterial pneumonia, cases were also assigned a bacterial etiology if the NPA was negative for virus but the patient had leukocytosis and a dense radiographic consolidation (endpoint pneumonia) based on consensus of 2 independent experts. Since NPAs are often positive on RT-PCR for potential viral respiratory pathogens even in clinically well children, the detection of a virus in the NPA did not alter the class assignments for confirmed bacterial or malarial cases. See Supplementary Figure 1 for a comprehensive flowchart for patient classification.
In addition, patient samples with mixed infections were also included in the study (for details see Supplementary Table 3). “Virus and probable bacterial secondary coinfection” samples were virus positive, culture- and PCR-negative for bacteria but with leukocytosis and radiographic endpoint pneumonia, suggestive of a secondary bacterial infection.
SOMAScan Protein Assay
The SOMAScan assay uses SOMAmers (Slow Off-rate Modified Aptamers) to capture proteins and translates binding events into signals measured in relative fluorescence units, which are directly proportional to target protein abundance in the sample, calculated by a standard curve generated for each protein–SOMAmer pair. The dynamic range is enhanced by 3 serial dilutions, with the least concentrated dilution used to quantify the most abundant proteins (approximate micromolar concentration in the original sample), and the most concentrated used for the least abundant proteins (femtomolar to picomolar concentration) [15]. Samples were assayed in 2 batches (15 samples replicated to verify consistency); the SOMAScan assays used in the first set of 167 samples quantified 1129 proteins and the SOMAScan assay used in the second set of 49 samples quantified 1279 proteins. In the 2 batches, 96.4% (161/167) and 100% (49/49) of samples passed Somalogic normalization acceptance criteria.
We use Somalogic protein marker labels throughout this article (Supplementary Data File 1 provides full protein names).
Protein Marker Selection and Predictive Model Building
Selection was based on statistical significance of differences in marker abundance between the bacteria versus virus (BvV) and bacteria versus malaria or virus (BvVM) comparisons. Classifiers discriminated (1) BvV and (2) BvVM using the 219 and 151 statistically significant (false discovery rate [FDR] <0.01) markers, respectively, and their corresponding surrogates. Using optimal subsets of N-protein markers (n = 5, 10, 15, 25, 50, 100) identified using genetic algorithms, 2-class Random Forest (RF) and Elastic Net (EN) models were constructed, achieving predictive results with high sensitivity and specificity with a small subset of markers (see Figure 1) (details in the “Data Analysis Pipeline” in the Supplementary Appendix and Supplementary Figure 3).
Figure 1.
Data analysis workflow. A total of 210 samples passed QC on the SOMAScan assay to quantify 1107 proteins. The 171 single etiology samples were classified as malaria, virus, or bacteria and included 12 repeats that were randomly split between the training and validation datasets; the 4 repeats that ended up in the validation dataset were excluded from downstream analysis. The remaining 39 samples consisted of 16 healthy community controls and 23 samples with mixed etiology. Single etiology samples were divided into a training set of 120 and a validation set of 47 samples. The training data were used for identifying differentially expressed markers between bacteria and virus, or bacteria and malaria or virus samples. Genetic algorithms were used to select the best 5, 10, 15, 25, 50, and 100 markers. Classifiers for BvV and BvVM were trained using RF and EN algorithms. Models were tuned using cross-validation, and final model performance was assessed using the validation data. In order to contend with the situation where a marker is unavailable (eg, due to difficulty in measuring the marker in a clinical setting), we determined a set of surrogate markers for each differential marker using information correlation, a criterion based on mutual information. We then assessed model performance when 10% or 20% of differential markers were substituted with their corresponding surrogates. Abbreviations: BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; EN, Elastic Net; QC, quality control; RF, Random Forest.
Biological Processes and Pathways
To better understand the biological significance of the differentially expressed proteins, differential markers were used as input to the Metascape Gene Annotation and Analysis Resource (http://metascape.org) to query multiple ontology resources including KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway, Gene Ontology (GO) Biological Processes, Reactome Gene Sets, Canonical Pathways, and CORUM. Both 3-way (bacteria vs malaria vs virus) and binary (BvV) comparisons were explored (see Supplementary Appendix for details).
Comparative Marker Analysis Between Technologies
To assess whether markers identified as indicating bacterial infection were consistent across technology platforms, extensive comparisons were made between this and 2 previous marker studies of the same patients (RNA-sequencing and multiplex bead-based protein immunoassays); both studied different but overlapping samples within the same study population (see details in the Supplementary Appendix).
RESULTS
Patient Characteristics
Between July 2010 and November 2014, 576 patients were recruited as inpatients, along with 117 community controls. A total of 195 patients under 10 years of age with acute febrile respiratory illness met the stringent inclusion criteria and were included in this analysis. To identify differentially expressed proteins between underlying etiologies, patients were characterized as having bacterial (69 patients), malaria (42 patients), viral (48 patients), or mixed (23 patients) infections (for details see Supplementary Table 3). Thirteen healthy subjects were included as controls. The classification scheme was similar, as previously described (see Supplementary Figure 1 for a patient classification flowchart) [9, 13]. No significant differences in age, sex, weight, height, nutritional status, or duration of hospital admission were observed between bacterial, viral, and malaria sample sets (see Table 1 and Supplementary Table 1 for patient demographic and disease characteristics). Case-fatality rates were high (6%) for the bacterial group, but none of the malaria cases or viral cases died. Malnutrition was highly prevalent among the 3 groups, and human immunodeficiency virus (HIV) prevalence was also high, although significantly higher among the bacterial group. Bacterial cases had the highest leukocyte count and respiratory rates. Malaria cases were the most anemic, had the highest mean axillary temperature, and had the lowest respiratory rates. Viral cases had the lowest leukocyte count, had lower mean axillary temperature, and were less anemic. Neutrophil levels were statistically higher for the bacterial etiology, but the overlap between etiologies was too great for this to serve as a classifier.
Table 1.
Patient Demographic and Disease Characteristics at Admission
| Bacteria and PCR Bacteria | Malaria | Virus | |||||
|---|---|---|---|---|---|---|---|
| Values | n | Values | n | Values | n | P a | |
| Features on admission (signs, symptoms, and laboratory results) | |||||||
| Age, mean (SD), m | 29.7 (29.4) | 69 | 26.3 (23.3) | 42 | 19.4 (19.9) | 48 | .12 |
| Female sex, n (%) | 24 (45) | 69 | 28 (67) | 42 | 23 (48) | 48 | .07 |
| Clinical examination results on arrival, mean (SD) | |||||||
| Weight, kg | 10.2 (4.8) | 69 | 10.3 (4.6) | 42 | 9 (3.4) | 48 | .44 |
| Height, cm | 80.3 (18.7) | 69 | 79.7 (18) | 42 | 74.5 (14.9) | 48 | .35 |
| MUAC, cm | 13.5 (2) | 69 | 14.3 (2) | 40 | 14.0 (1.5) | 48 | .75 |
| Temperature, °C | 38.2 (1.2) | 69 | 38.4 (1.4) | 42 | 37.6 (1.1) | 48 | .041 |
| Respiratory rate, cycles per minute | 60.3 (14.7) | 69 | 53.1 (8.7) | 40 | 56.7 (9.4) | 48 | .02 |
| Nutritional status | |||||||
| WAZ > −1 SD, n (%) | 16 (24.6) | 69 | 17 (40.5) | 42 | 24 (50.0) | 48 | |
| WAZ −1 SD to −3 SDs (low to severe underweight), n (%) | 30 (52.2) | 69 | 18 (42.9) | 42 | 17 (35.4) | 48 | |
| WAZ < −3 SDs (severe underweight), n (%) | 12 (23.2) | 69 | 7 (16.7) | 42 | 7 (14.6) | 48 | |
| WAZ, mean (SD) | −2 (1.8) | 69 | −1.5 (1.7) | 42 | −1.5 (1.7) | 48 | .09 |
| Anemia status on admission | |||||||
| Hemoglobin, mean (SD), g/dL | 8.6 (2.2) | 69 | 7.4 (2.3) | 40 | 10.0 (2.1) | 48 | <.0001 |
| HCT (%), mean (SD) | 26.1 (6.2) | 69 | 22.1 (7) | 40 | 29.7 (5.9) | 48 | <.0001 |
| No anemia (HCT >33%), n (%) | 4 (6) | 69 | 2 (5.0) | 40 | 14 (29.2) | 48 | |
| Mild anemia (HCT 25–33%), n (%) | 31 (46.3) | 69 | 11 (27.5) | 40 | 25 (52.1) | 48 | |
| Moderate anemia (HCT 15–25%), n (%) | 32 (47.8) | 69 | 22 (55.0) | 40 | 8 (16.7) | 48 | |
| Severe anemia (HCT ≤15%), n (%) | 0 (0) | 69 | 5 (12.5) | 40 | 1 (2.1) | 48 | |
| Microbiology and other laboratory results on admission | |||||||
| HIV status positive, n (%) | 25 (36.2) | 69 | 3 (7.1) | 42 | 5 (10.4) | 48 | <.0001 |
| Viral coinfections, n (%) | 32 (46) | 69 | 22 (52) | 42 | - | - | .56b |
| Positive blood culture, n (%) | 29 (42.0) | 69 | 0 (0) | 42 | 0 (0) | 48 | |
| WBC count, mean (SD), 103/μL | 21.3 (13.1) | 69 | 14.2 (8.4) | 41 | 10.8 (2.7) | 48 | <.0001 |
| Neutrophil granulocytes, mean (SD), 103/μL | 13.7 (9.8) | 57 | 5.2 (3) | 32 | 4.8 (2.4) | 45 | <.0001 |
| Plasmodium density, geometric mean (SD), parasites/μL | 0 (0) | 69 | 5.9 (5.8) | 42 | 0 (0) | 48 | |
| Malaria positive (microscopy positive) | 0 (0) | 69 | 42 (100) | 42 | 0 (0) | 48 | <.0001 |
| Chest X-ray results, n (%) | |||||||
| Normal | 9 (15.0) | 69 | 42 (100) | 42 | 27 (56.3) | 48 | <.0001 |
| Other infiltrate/abnormality | 7 (11.7) | 69 | 0 (0) | 42 | 21 (43.8) | 48 | |
| Primary endpoint pneumonia | 44 (73.3) | 69 | 0 (0) | 42 | 0 (0) | 48 | |
| Evolution during admission | |||||||
| Length of admission, median (IQR), d | 4.1 (2–6.1) | 6 | 3.4 (2–5) | 4 | 3.8 (2.2–4.8) | 4 | .29 |
| Case-fatality rate (in-hospital death), n (%) | 3 (4.4) | 6 | 0 (0) | 4 | 0 (0) | 4 | .15 |
The “bacteria” group includes blood or pleural fluid culture-positive samples, samples PCR positive for respiratory pathogens, and samples with positive leukocytosis and a dense radiographic consolidation (endpoint pneumonia) as independently assessed by 2 experts. Samples that were culture or PCR positive for contaminant bacteria were excluded. Abbreviations: HCT, hematocrit; HIV, human immunodeficiency virus; IQR, interquartile range; MUAC, middle upper arm circumference; PCR, polymerase chain reaction; SD, standard deviation; WAZ, weight-for-age z score, z-score cutoff point of < −2 SDs and < −3 SDs is classified as low weight for age and severe undernutrition, respectively; WBC, white blood cell.
a P values for continuous variables were estimated through analysis of variance (Kruskal-Wallis test). P values for categorical variables used chi-square test.
b P value of the categorical variable was estimated through Fisher’s exact test.
From the 195 patients, 210 peripheral blood samples (including 15 replicates, 4 of which were excluded from downstream analysis) were assayed for protein composition using the SOMAScan platform (see Figure 1). Sample characteristics and designations of single (167 samples) and mixed infections with controls (39 samples) can be found in Supplementary Tables 2 and 3, respectively.
Differential Markers
Using the SOMAScan data, 219 and 151 differentially expressed protein markers (FDR <0.01) were identified in the BvV comparison (Supplementary Table 4A, heatmap in Figure 2) and the BvVM comparison (Supplementary Table 4B, heatmap in Supplementary Figure 2B), respectively. The differential protein expression signatures determined by SOMAscan are shown in the heatmap in Figure 2A. This signal is manifest only after marker selection; unsupervised clustering in the space of the entire 1107 protein panel does not reveal a clear dominant structure related to infectious etiology (Supplementary Figure 2A). Box-and-whisker plots of the 100 top-ranked markers are depicted in Supplementary Figure 4.
Figure 2.
A, Hierarchically clustered heatmap of normalized SOMAscan expression values for 219 significant markers (FDR <0.01) from the SOMAScan BvV comparison in the space of all single etiology bacterial and viral samples in this study (see Supplementary Figure 10 for full resolution with details). Top track: viral (yellow) and bacterial (blue) etiology. B, Top 10 rank-ordered protein markers (highest to lowest, left to right) in our BvV and BvVM marker sets. Abbreviations: BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; CCL23, C-C motif chemokine 23; CSF3, granulocyte colony-stimulating factor; CX3CL1, fractalkine; ESD, S-formylglutathione hydrolase; FDR, false discovery rate; HP, haptoglobin; IL1RL1, interleukin-1 receptor-like 1; IL6, interleukin-6; ITIH4, inter-ɑ-trypsin inhibitor heavy chain H4; KYNU, kynureninase; LCN2, neutrophil gelatinase-associated lipocalin; max, maximum; min, minimum; NTN4, netrin-4; PLA2G2A, phospholipase A2; RETN, resistin; S100A9, protein S100-A9; SERPINA1, ɑ1-antitrypsin; SLPI, anti-leukoproteinase.
Performance of Predictive Diagnostic Models
Our chief aim was to develop a protein-based biomarker panel to distinguish bacterial from other etiologies of clinical pneumonia with accuracy that would support clinical decision making. The RF and EN models had generally similar performance, with RF models performing slightly better overall (see Supplementary Table 5Cand 5D) and declining in performance more smoothly with fewer input markers. We therefore focused subsequent analyses on RF results.
In single etiology samples, the performance of the BvV model (evaluated on the held-aside validation samples) was excellent. Sensitivity and specificity for bacterial cases using all 219 markers were 90% and 100%, respectively, meeting the Foundation for Innovative New Diagnostics (FIND) proposed criteria for a diagnostic test of these characteristics [14]. Furthermore, sensitivity and specificity remained at 90% and 85% with only 5 markers, potentially simplifying the translation to a field-deployable diagnostic. Accuracy was 94% (95% confidence interval [CI]: .79–.99) and 88% (95% CI: .71–.96), with 219 and 5 markers, respectively (Table 2A and Supplementary Table 5A).
Table 2.
Single Etiology and Mixed-Infection Validation Set Predictive Diagnostic Results
| (A) Single Etiology Samples | Bacteria vs Virus (n = 32) | Bacteria vs Virus or Malaria (n = 47) | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No. of Markers | All Significant Markers (219) | 5 Markers | All Significant Markers (151) | 5 Markers | ||||||||||||
| Confusion Matrix | Actual | Actual | Actual | Actual | ||||||||||||
| Bacteria | Virus | Bacteria | Virus | Bacteria | Virus | Bacteria | Virus | |||||||||
| Predicted | Bacteria | 17 | 0 | Predicted | Bacteria | 17 | 2 | Predicted | Bacteria | 13 | 0 | Predicted | Bacteria | 10 | 10 | |
| Virus | 2 | M | Virus | 2 | 11 | Virus or malaria | 6 | 28 | Virus or Malaria | 9 | 18 | |||||
| Accuracy | 0.94 | 0.88 | 0.87 | 0.60 | ||||||||||||
| 95% CI | (.79, .99) | (.71, .96) | (.74, .95) | (.44, .74) | ||||||||||||
| Sensitivity | 0.90 | 0.90 | 0.68 | 0.53 | ||||||||||||
| Specificity | 1.00 | 0.85 | 1.00 | 0.64 | ||||||||||||
| No. of markers | 219 | 100 | 50 | 25 | 15 | 10 | 5 | 151 | 100 | 50 | 25 | 15 | 10 | 5 | ||
| Accuracy | 0.94 | 0.94 | 0.97 | 0.91 | 0.84 | 0.84 | 0.88 | 0.87 | 0.83 | 0.81 | 0.85 | 0.77 | 0.77 | 0.60 | ||
| 10% Surrogates | 0.94 | 0.91 | 0.94 | 0.88 | 0.88 | 0.88 | 0.91 | 0.83 | 0.83 | 0.81 | 0.83 | 0.77 | 0.74 | 0.62 | ||
| 20% Surrogates | 0.97 | 0.91 | 0.97 | 0.91 | 0.81 | 0.78 | 0.88 | 0.85 | 0.83 | 0.81 | 0.83 | 0.77 | 0.77 | 0.57 | ||
| (B) Mixed-infection samples | Bacteria vs Virus (n = 39) | Bacteria vs Virus or Malaria (n = 39) | ||||||||||||||
| No. of markers | All Significant Markers (219) | 5 Markers | All Significant Markers (151) | 5 Markers | ||||||||||||
| Confusion Matrix | Actual | Actual | Actual | Actual | ||||||||||||
| Bacteria | Virus | Bacteria | Virus | Bacteria | Virus | Bacteria | Virus | |||||||||
| Predicted | Bacteria | 19 | 3 | Predicted | Bacteria | 16 | 4 | Predicted | Bacteria | 13 | 4 | Predicted | Bacteria | 17 | 6 | |
| Virus | 1 | 16 | Virus | 4 | 15 | Virus or Malaria | 7 | 15 | Virus or Malaria | 3 | 13 | |||||
| Accuracy | 0.90 | 0.79 | 0.72 | 0.77 | ||||||||||||
| 95% CI | (.76, .97) | (.64, .91) | (.55, .85) | (.61, .89) | ||||||||||||
| Sensitivity | 0.95 | 0.80 | 0.65 | 0.85 | ||||||||||||
| Specificity | 0.84 | 0.79 | 0.79 | 0.68 | ||||||||||||
| No. of markers | 219 | 100 | 50 | 25 | 15 | 10 | 5 | 151 | 100 | 50 | 25 | 15 | 10 | 5 | ||
| Accuracy | 0.90 | 0.90 | 0.90 | 0.90 | 0.74 | 0.85 | 0.79 | 0.72 | 0.77 | 0.67 | 0.90 | 0.85 | 0.82 | 0.77 | ||
| Sensitivity | 0.95 | 0.90 | 0.95 | 0.90 | 0.85 | 0.85 | 0.80 | 0.65 | 0.80 | 0.80 | 0.90 | 0.95 | 0.95 | 0.85 | ||
| Specificity | 0.84 | 0.89 | 0.84 | 0.90 | 0.63 | 0.84 | 0.79 | 0.79 | 0.74 | 0.53 | 0.89 | 0.74 | 0.68 | 0.68 | ||
(A) Confusion matrices and performance specifications for models using all (219 and 151, respectively) markers and 5 markers, as well as accuracy for models using 5, 10, 15, 25, 50, 100, and 219 markers with 0%, 10%, or 20% surrogates. The BvV validation set contains 19 bacteria and 13 virus samples, and the BvVM set contains 19 bacteria, 15 malaria, and 13 virus samples. (B) Confusion matrices and performance statistics with all (151) or only 5 markers depicted. The mixed-infection test set contains 39 samples. All samples that contain the term “bacteria” are considered positive bacterial pneumonia cases. Abbreviations: BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; CI, confidence interval.
The BvVM RF model had an accuracy of 87% (95% CI: .74–.95), a specificity of 100%, and a sensitivity of 68%. When decreasing the panel size to only 5 markers, accuracy decreased to 60%, specificity to 64%, and sensitivity to 53% (see Table 2 and Supplementary Table 5B).
On healthy controls and mixed-infection samples, the BvV RF model performed well, with 95% sensitivity, 84% specificity, and 90% accuracy (95% CI: .76–.97) (Table 2B). The model correctly predicted the majority of bacterial infections and bacterial coinfections, successfully distinguishing these from nonbacterial infections (malaria and/or virus). Supplementary Table 6 depicts BvV and BvVM RF model statistics on mixed-infection samples without controls. To compare, we found that the clinical ALMANACH models were uniformly inferior to our molecular predictors, with a particularly dramatic loss of specificity (see Supplementary Table 7 for details).
Genetic Algorithm-Derived and Surrogate Markers
Marker subsets (with “n” ranging from 5 to 100 markers) were selected using genetic algorithms. Since the results can be nondeterministic, the method was re-run multiple times. Across all runs of the genetic algorithm, interleukin 1 receptor type 1, high mobility group box 1, programmed cell death 1 ligand 2, roundabout guidance receptor 2, and pregnancy-associated plasma protein A were the 5 protein markers most often selected. For the BvVM models, the most-selected markers were lymphotoxin ɑ2/B1 protein (LTA.LTB1), TPI1, ɑ1-antitrypsin (SERPINA1), IGFBP2, and ROR1 (see Supplementary Data File 2 for complete marker lists).
We next assessed whether models were robust to replacement of individual markers by corresponding surrogates. This provides an index of model stability and has practical relevance when converting predictive models into diagnostics, which may require marker substitution for technical reasons. The RF (and EN) classifiers for both BvV and BvVM proved to be robust to the choice of specific markers: classifier accuracy did not significantly decline even when 20% of the markers were replaced with surrogates (Supplementary Figure 5).
Biological Processes and Pathway Analysis
To gain insight into the biology underlying the markers of bacterial and viral infection, multiple databases were queried for functional and pathway annotations. Terms significantly enriched in the bacterial or viral pneumonia marker sets were automatically clustered into nonredundant groups (details in Methods). Marker support for terms is shown in Figure 3A and the top 20 clusters in Figure 3B. Individual terms (and therefore clusters) could be supported by both bacterial and viral markers. Most clusters had support from both etiologies, but a subset (blue or red circles in Figure 3) was strongly associated with a single etiology. Two GO clusters, “chemotaxis” and “regulation of neurogenesis,” were driven almost exclusively by viral markers, while “response to bacterium” (our top ranked GO term with 39 gene hits), “regulated exocytosis,” “antimicrobial humoral response,” “positive regulation of response to external stimulus,” and “signaling by interleukins” were driven almost exclusively by bacterial markers.
Figure 3.
Pathways and gene enrichment analysis with differential markers shared between this study, RNA-sequencing, and RBM multiplex assay studies with the same study population. A–C, Clustered terms enriched in our bacteria vs virus 2-class comparison. Each node represents 1 term describing a biological process or pathway. Edges connect similar terms (similarity score [κ] >0.3); the thickness of the edge represents the similarity score. Each term is represented by a circle node, where the size is proportional to the number of input markers. The underlying file can be found as an additional supplementary file (“Cytoscape BvV network”). A, Distribution of support for each node from bacterial (red) and viral (blue) markers (ie, each pie sector is proportional to the number of hits that originated from a particular marker list). B, Nodes colored by their membership in 1 of the top 20 clusters. Each cluster is named for the term (node) with the best P value. Inset table: neutrophil degranulation, considered as a sub-pathway of regulated exocytosis, was detected as the major biological GO pathway shared between the BvVM marker set of this study and RNA-sequencing data. Of the 18 bacterial markers overlapping between the studies, 10 markers are directly involved in neutrophil degranulation (see panel E for all 18 markers). C, Bacteria vs virus marker set with nodes colored by P value. The darker the color, the more statistically significant the node (see legend for P value ranges). D and E, RBM and SOMAScan protein aliases were converted into their gene names to compare markers between studies. D, Overlap of selected marker sets: SOMAScan (BvVM, n = 156), RNA-sequencing (BvVM, n = 431), and RBM immunoassay (BvV and BvM, n = 21). E, Two direct comparisons of marker sets derived through the same approach (BvV and BvVM); filled circles indicate a marker identified in the specified analysis. Markers that overlapped in the 2 direct comparisons are depicted by filled circles, but not between the 4 individual marker sets. The color indicates the direction of expression change. Red: upregulation in bacterial samples; dark blue: downregulation. Light gray: the marker was not detected or not included in at least 1 of the 2 marker sets. Abbreviations: ALPL, alkaline phosphatase; BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; CHIT1 - chitinase 1; CKM, creatine kinase, M-type; CST7, cystatin F; GO, Gene Ontology; HGF, hepatocyte growth factor; HP, haptoglobin; IL18R1, interleukin 18 receptor 1; IL6, interleukin-6; ILI8RAP, interleukin 18 receptor accessory protein; LCN2, neutrophil gelatinase-associated lipocalin; LTF, lactotransferrin; MMP8, matrix metalloproteinase-8; MPO, myeloperoxidase; OSM, oncostatin M; PGLYRP1, peptidoglycan recognition protein 1; PLXNC1, plexin C1; RETN, resistin; S100A9, protein S100-A9; SERPINA1, ɑ1-antitrypsin; SLPI, anti-leukoproteinase.
Neutrophil-related biological processes emerged as a key biological theme associated with bacterial infection. In particular, the “regulated exocytosis” GO cluster (34 gene hits) represents mostly neutrophil- or leukocyte-related terms. Within the top 36 GO clusters (out of 1388 total clusters, ranked by P value), 6 highly significant clusters consisting of 14 to 26 gene hits each were identified as neutrophil processes (“migration,” “mediated-immunity,” “activation,” “degranulation,” “activation involved in immune response,” and “chemotaxis”). Notably, no other cell type or subpopulation besides neutrophils appeared within the first 243 rank-ordered GO clusters. The “neutrophil degranulation” cluster was particularly prominent in markers that were identified by both SOMAScan and RNA-sequencing; it contained 10 of the 24 markers that emerged from that cross-platform comparison (Figure 3B, 3D, and 3E). This was further demonstrated by 3-way enrichment heatmap comparisons (Supplementary Figure 9).
Comparisons Between Datasets and Technologies
To assess the consistency of the results, we compared gene marker sets from similar marker-focused studies of the same population using different technologies. First, the current BvVM marker set was compared with markers found in our previously published RNA-sequencing approach [9]. Of the 1107 proteins included in our SOMAScan assay, 78 were represented by genes from the set of 600 significant differentially expressed markers in the RNA-sequencing analysis (of ~12 000 expressed genes) (Supplementary Data File 1D). Twenty-five of these 78 genes (corresponding to 24 proteins) proved to be statistically significant markers in our comparison (Supplementary Figure 6).
In the RNA data, 18 of those 24 proteins were markers for bacterial infection and 6 were markers for malaria infection. A heatmap of these markers highlights the strong class distinctions (Supplementary Figure 7). Haptoglobin (HP) is markedly down and hemoglobin up in malaria samples, but the majority of markers are elevated in bacterial samples (Figure 3E, and see Supplementary Figures 6 and 8 for details on the malaria markers). When we used the SOMAScan data for these 24 markers to build RF and EN models, they performed similarly to 25 protein marker models optimized by the genetic algorithm (Supplementary Table 5E and 5F), suggesting that those 24 markers would also be good candidate markers for a diagnostic assay.
We also compared the SOMAScan marker sets with findings from a previous protein-based immunoassay (the rules based medicine [RBM] multiplex immunoassay) [13]. Five markers were identified as differential markers for bacterial pneumonia in both datasets: CKM, HP, IL6, myeloperoxidase (MPO), and SERPINA1 (Figure 3, Supplementary Figure 6). Three markers, HP, MPO, and SERPINA1, were identified as significant markers in all 3 studies (SOMAScan, multiplex immunoassay, and RNA-sequencing) despite the very different methodologies employed (Venn diagram in Figure 3D). Two markers appear in both the SOMAScan and multiplex immunoassay data as likely markers for malaria infection, VCAM1 and APCS [13].
DISCUSSION
We present diagnostic models based on aptamer-derived blood protein signatures that accurately discriminate bacterial from viral infections of pediatric febrile respiratory illness with as few as 5 protein markers (94% accuracy, 90% sensitivity, 85% specificity), meeting/exceeding the FIND-sponsored expert consensus guidelines on diagnostics for bacterial pneumonia [14]. Accurate discrimination of bacterial infection from both viral and malaria etiologies was achieved with 25 markers.
Because the BvV model was highly predictive, we investigated the proteins to understand the processes that typify bacterial and viral infections. Gene enrichment and pathway analyses showed neutrophil-dominated processes in bacterial infections. The consistency of a neutrophilic host-response signature is highlighted by common signals (18 bacterial markers) across prior studies at both the RNA and protein level, despite model and platform differences (Figure 3E). Reinforcing this observation, a cross-platform 24-marker set, highly enriched for neutrophil-associated proteins (neutrophil degranulation being prominent), proved to be equally effective in differentiating bacteria versus other (Supplementary Table 5F). Ten of the 18 bacterial markers were associated with bacterial airway inflammation, modifying, mitigating, or augmenting neutrophil immunological responses. For example, SERPINA1 and SLPI are both protease inhibitors regulating neutrophil elastase activity [19–21].
Bacterial pneumonia diagnostics are a challenge globally for all countries in pediatric populations with a need for better diagnostics to improve antibiotic stewardship and mortality outcomes. The limitations of the current WHO clinical pneumonia definition were improved upon by our strict further criteria, laboratory testing, and consensus review to produce the best possible set of pneumonia cases. Our objective was to develop protein-based predictors that could eventually be ported to a field-deployable device for discriminating bacterial from nonbacterial pneumonia. While larger validation studies are needed, this study provides strong evidence that a blood-based protein panel of limited size can achieve the sensitivity and specificity required to guide clinical decisions regarding antibiotic therapy. By identifying biologically plausible sets of markers, the groundwork for the development of a point-of-care test has been established, particularly considering that some of these markers (HP, SERPINA1, MPO, etc) are relatively simple to measure. We identified surrogate proteins that can be exchanged for markers in our models without loss of accuracy, allowing flexibility in developing a diagnostic test. Although optimized for single etiology samples, our models performed well in mixed infections representing the natural complexity of febrile respiratory illness. Importantly, these markers seem to discriminate appropriately, even in the context of a high underlying malnutrition or HIV prevalence, such as the one in Manhiça, southern Mozambique [22, 23]. This is a significant benchmark, as a predictor must be effective across the spectrum of real-life clinical scenarios. Finally, our study provided insights into the host-response biology in our discriminant marker proteins. These observations may inform marker selection in future prospective studies and, together with our specific models and markers, may facilitate the development of the optimized markers for pneumonia diagnosis with the eventual transition to point-of-care tests that are needed to change future clinical practice, particularly for those settings where associated case-fatality rates for common infections remain high and diagnostic tools scarce.
Supplementary Data
Supplementary materials are available at Clinical Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.
Notes
Acknowledgments. The authors gratefully acknowledge the crucial commitments of our colleagues Godfrey Allan Otieno, Jacob Silterra, Katherine Almendinger, Karsten Krug, Roger Wiegand, and Rushdy Ahmad to the overall success of this clinical study, especially in the early phases. The original Somalogic data files along with matching de-identified clinical data are available through collaboration with Dr Quique Bassat. The code is available as a .zip file in the Supplementary Material and includes (1) code used for the study (2) datasets and intermediate tables and (3) results. README.txt file is available.
Financial support. This work was supported by the Bill and Melinda Gates Foundation (grant number OPP50092). The Centro de Investigação em Saúde de Manhiça (CISM) is supported by the Government of Mozambique and the Spanish Agency for International Development (AECID). ISGlobal receives support from the Spanish Ministry of Science and Innovation through the “Centro de Excelencia Severo Ochoa 2019–2023” Program (CEX2018-000806-S), and support from the Generalitat de Catalunya through the Centres de Recerca de Catalunya Program.
Potential conflicts of interest. The authors: No reported conflicts of interest. All authors have submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest.
References
- 1. Bhattacharya S, Rosenberg AF, Peterson DR, et al. Transcriptomic biomarkers to discriminate bacterial from nonbacterial infection in adults hospitalized with respiratory illness. Sci Rep 2017; 7:6548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Bhatt S, Weiss DJ, Cameron E, et al. The effect of malaria control on Plasmodium falciparum in Africa between 2000 and 2015. Nature 2015; 526:207–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Troeger C, Forouzanfar M, Rao PC, et al. Estimates of the global, regional, and national morbidity, mortality, and aetiologies of lower respiratory tract infections in 195 countries: a systematic analysis for the Global Burden of Disease Study 2015. Lancet Infect Dis 2017; 17:1133–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Deng P, Yu J, Zhou N, Hu M. Access to medicines for acute illness and antibiotic use in residents: a medicines household survey in Sichuan Province, western China. PLoS One 2018; 13:e0201349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Holloway KA, Rosella L, Henry D. The impact of WHO essential medicines policies on inappropriate use of antibiotics. PLoS One 2016; 11:e0152020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Carter IWJ, Schuller M, James GS, Sloots TP, Halliday CL.. PCR for clinical microbiology: an Australian and International perspective. Dordrecht: Springer Science & Business Media, 2010. [Google Scholar]
- 7. Takem EN, Roca A, Cunnington A. The association between malaria and non-typhoid Salmonella bacteraemia in children in sub-Saharan Africa: a literature review. Malar J 2014; 13:400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Church J, Maitland K. Invasive bacterial co-infection in African children with Plasmodium falciparum malaria: a systematic review. BMC Med 2014; 12:31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Silterra J, Gillette MA, Lanaspa M, et al. Transcriptional categorization of the etiology of pneumonia syndrome in pediatric patients in malaria-endemic areas. J Infect Dis 2017; 215:312–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Pettigrew MM, Gent JF, Kong Y, et al. Association of sputum microbiota profiles with severity of community-acquired pneumonia in children. BMC Infect Dis 2016; 16:317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Feng C, Huang H, Huang S, et al. Identification of potential key genes associated with severe pneumonia using mRNA-seq. Exp Ther Med 2018; 16:758–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Naess A, Nilssen SS, Mo R, Eide GE, Sjursen H. Role of neutrophil to lymphocyte and monocyte to lymphocyte ratios in the diagnosis of bacterial infection in patients with fever. Infection 2017; 45:299–307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Valim C, Ahmad R, Lanaspa M, et al. Responses to bacteria, virus, and malaria distinguish the etiology of pediatric clinical pneumonia. Am J Respir Crit Care Med 2016; 193:448–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Dittrich S, Tadesse BT, Moussy F, et al. Target product profile for a diagnostic assay to differentiate between bacterial and non-bacterial infections and reduce antimicrobial overuse in resource-limited settings: an expert consensus. PLoS One 2016; 11:e0161721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. SomaLogic, Inc. SOMAscan® proteomic assay technical white paper [Internet]. SomaLogic, Inc, 2017. Available at: http://somalogic.com/wp-content/uploads/2017/06/SSM-002-Technical-White-Paper_010916_LSM1.pdf. Accessed 2 November 2018.
- 16. O’Grady KA, Torzillo PJ, Ruben AR, Taylor-Thomson D, Valery PC, Chang AB. Identification of radiological alveolar pneumonia in children with high rates of hospitalized respiratory infections: comparison of WHO-defined and pediatric pulmonologist diagnosis in the clinical context. Pediatr Pulmonol 2012; 47:386–92. [DOI] [PubMed] [Google Scholar]
- 17. Cherian T, Mulholland EK, Carlin JB, et al. Standardized interpretation of paediatric chest radiographs for the diagnosis of pneumonia in epidemiological studies. Bull World Health Organ 2005; 83:353–9. [PMC free article] [PubMed] [Google Scholar]
- 18. Alonso PL, Sacarlal J, Aponte JJ, et al. Efficacy of the RTS,S/AS02A vaccine against Plasmodium falciparum infection and disease in young African children: randomised controlled trial. Lancet 2004; 364:1411–20. [DOI] [PubMed] [Google Scholar]
- 19. Thompson RC, Ohlsson K. Isolation, properties, and complete amino acid sequence of human secretory leukocyte protease inhibitor, a potent inhibitor of leukocyte elastase. Proc Natl Acad Sci USA 1986; 83:6692–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. McCarthy C, Reeves EP, McElvaney NG. The role of neutrophils in alpha-1 antitrypsin deficiency. Ann Am Thorac Soc 2016; 13(Suppl 4):S297–304. [DOI] [PubMed] [Google Scholar]
- 21. du Bois RM, Bernaudin JF, Paakko P, et al. Human neutrophils express the alpha 1-antitrypsin gene and produce alpha 1-antitrypsin. Blood 1991; 77:2724–30. [PubMed] [Google Scholar]
- 22. González R, Munguambe K, Aponte J, et al. High HIV prevalence in a southern semi-rural area of Mozambique: a community-based survey. HIV Med 2012; 13:581–8. [DOI] [PubMed] [Google Scholar]
- 23. Nhampossa T, Sigaúque B, Machevo S, et al. Severe malnutrition among children under the age of 5 years admitted to a rural district hospital in southern Mozambique. Public Health Nutr 2013; 16:1565–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



