Skip to main content
mBio logoLink to mBio
. 2024 Aug 29;15(10):e03355-23. doi: 10.1128/mbio.03355-23

Prediction of post-PCV13 pneumococcal evolution using invasive disease data enhanced by inverse-invasiveness weighting

Xueting Qiu 1,, Lesley McGee 2, Laura L Hammitt 3, Lindsay R Grant 3, Katherine L O’Brien 3,3, William P Hanage 1, Marc Lipsitch 1,4,
Editor: Paul Keim5
PMCID: PMC11481909  PMID: 39207103

ABSTRACT

After introducing pneumococcal conjugate vaccines (PCVs), serotype replacement occurred in Streptococcus pneumoniae. Predicting which pneumococcal strains will become common in carriage after vaccination can enhance vaccine design, public health interventions, and understanding of pneumococcal evolution. Invasive pneumococcal isolates were collected during 1998–2018 by the Active Bacterial Core surveillance (ABCs). Carriage data from Massachusetts (MA) and Southwest United States were used to calculate weights. Using pre-vaccine data, serotype-specific inverse-invasiveness weights were defined as the ratio of the proportion of the serotype in carriage to the proportion in invasive data. Genomic data were processed under bioinformatic pipelines to define genetically similar sequence clusters (i.e., strains), and accessory genes (COGs) present in 5–95% of isolates. Weights were applied to adjust observed strain proportions and COG frequencies. The negative frequency-dependent selection (NFDS) model predicted strain proportions by calculating the post-vaccine strain composition in the weighted invasive disease population that would best match pre-vaccine COG frequencies. Inverse-invasiveness weighting increased the correlation of COG frequencies between invasive and carriage data in linear or logit scale for pre-vaccine, post-PCV7, and post-PCV13; and between different epochs in the invasive data. Weighting the invasive data significantly improved the NFDS model’s accuracy in predicting strain proportions in the carriage population in the post-PCV13 epoch, with the adjusted R2 increasing from 0.254 before weighting to 0.545 after weighting. The weighting system adjusted invasive disease data to better represent the pneumococcal carriage population, allowing the NFDS mechanism to predict strain proportions in carriage in the post-PCV13 epoch. Our methods enrich the value of genomic sequences from invasive disease surveillance.

IMPORTANCE

Streptococcus pneumoniae, a common colonizer in the human nasopharynx, can cause invasive diseases including pneumonia, bacteremia, and meningitis mostly in children under 5 years or older adults. The PCV7 was introduced in 2000 in the United States within the pediatric population to prevent disease and reduce deaths, followed by PCV13 in 2010, PCV15 in 2022, and PCV20 in 2023. After the removal of vaccine serotypes, the prevalence of carriage remained stable as the vacated pediatric ecological niche was filled with certain non-vaccine serotypes. Predicting which pneumococcal clones, and which serotypes, will be most successful in colonization after vaccination can enhance vaccine design and public health interventions, while also improving our understanding of pneumococcal evolution. While carriage data, which are collected from the pneumococcal population that is competing to colonize and transmit, are most directly relevant to evolutionary studies, invasive disease data are often more plentiful. Previously, evolutionary models based on negative frequency-dependent selection (NFDS) on the accessory genome were shown to predict which non-vaccine strains and serotypes were most successful in colonization following the introduction of PCV7. Here, we show that an inverse-invasiveness weighting system applied to invasive disease surveillance data allows the NFDS model to predict strain proportions in the projected carriage population in the post-PCV13/pre-PCV15 and pre-PCV20 epoch. The significance of our research lies in using a sample of invasive disease surveillance data to extend the use of NFDS as an evolutionary mechanism to predict post-PCV13 population dynamics. This has shown that we can correct for biased sampling that arises from differences in virulence and can enrich the value of genomic data from disease surveillance and advance our understanding of how NFDS impacts carriage population dynamics after both PCV7 and PCV13 vaccination.

KEYWORDS: Streptococcus pneumoniae, invasive disease surveillance data, inverse-invasiveness weighting, carriage population, negative frequency-dependent selection

INTRODUCTION

Understanding pathogen population response to vaccination is essential for forecasting disease burden and improving vaccine composition, especially for pathogens that evolve quickly and/or are characterized by cocirculating antigenically diverse strains. Streptococcus pneumoniae (the pneumococcus), a common nasopharyngeal colonizer in children, is a pathogen that has been categorized into more than 100 serotypes. Invasive pneumococcal disease (IPD), including pneumococcal pneumonia, bacteremia, and meningitis, has historically been a leading source of pediatric morbidity and mortality. Since pneumococcal conjugate vaccine (PCV) was introduced in children in 2000, PCVs have prevented many cases of IPD and non-invasive disease in both children and indirectly adults, and produced significant public health benefits globally (1, 2). Because PCVs target only a small number of the more than 100 pneumococcal serotypes, the non-vaccine serotypes (NVTs) have increased in prevalence and partially eroded the benefits of vaccination. The prevalence of NVTs nasopharyngeal carriage has increased substantially among children after the introduction of 7-valent PCV (PCV7) with little or no net change observed in the carriage prevalence of the bacteria (3). The increase in NVTs’ absolute abundance alongside the decline in vaccine-targeted serotypes is referred to as serotype replacement (3, 4). A key question is how much each of the existing NVTs may be expected to expand in carriage prevalence as their competitors are removed by vaccination, and more generally, how the abundance of specific genetic lineages will change following vaccination.

Negative frequency-dependent selection (NFDS) on the accessory genome of S. pneumoniae is one evolutionary mechanism with some predictive power for post-vaccination strain dynamics in this species (5, 6). This mechanism posits that above a certain equilibrium frequency, the presence of a gene in a strain is deleterious, but below that frequency it is advantageous; therefore, the frequency of this gene tends to return to its equilibrium after vaccination perturbs the population. A model based on this concept allowed the dynamics of individual strains in the carriage population post-PCV7 to be modeled as a selective process in which NFDS changes strain frequencies to produce gene frequencies similar to those observed at the pre-vaccine equilibrium (6). The result was an ability to predict which strains would succeed, and which would not, following vaccination.

Pneumococcal serotypes have variable ability to cause IPD, which means that strains collected from population-based IPD surveillance do not reflect how commonly different serotypes are carried. The pneumococci carried in the human nasopharynx are the populations that are transmitted from person to person and are therefore the bacterial populations upon which selection acts (7). Predictive or descriptive accounts of pneumococcal evolution therefore should be based on the population of colonizing pneumococci. However, in some circumstances, much larger and more representative data sets are only available for pneumococci causing invasive disease; this is particularly true for the period post-2010 when the 13-valent pneumococcal conjugate vaccine (PCV13) was introduced in the United States.

The purpose of this study is therefore twofold. We first describe a simple weighting system that adjusts a sample of invasive disease isolates to better represent the carriage population, showing that this weighting system improves the fit of previously studied NFDS models. Then we use weighted invasive disease data to test whether such models can effectively predict the proportional distribution of pneumococcal colonizing clones in the post-PCV13 epoch.

RESULTS

Weight values

An inverse measure of invasiveness as a weight was created to convert invasive disease abundance of isolates carrying each serotype to an estimate of the frequency of such isolates in carriage. From the pre-vaccine epoch E1, there were 6,976 specimens in the IPD data from 1998 to 2000 and 384 sequencing isolates from 1998 to 2001 in a combined data set from the state of Massachusetts and from the Southwest United States. Using these data from the pre-vaccine epoch E1 in the United States, the weight for each serotype is the ratio of the proportion of the serotype in the carriage data to the proportion in the IPD data. The weight value for each serotype is displayed in Fig. 1 (tabulated in Supplemental Data 3). The vaccine serotypes (VTs) in PCV7 and PCV13 typically had weight values lower than or close to 1, indicating that VTs were overrepresented in invasive disease compared to carriage. Many non-vaccine serotypes (NVTs) had weights above 1, indicating types that were more common in the carriage data. In the pre-vaccine epoch, 72% of the IPD population [31 serotypes including 8 VTs (4, 18C, 14, 6B, 9V, 7F, 1, 3) of which the last 3 were only in PCV13] had a weight <1, and 28% of the population [34 serotypes including 5 VTs (19F, 23F, 19A, 6A, 5) of which the last 3 were only in PCV13] had a weight >1.

Fig 1.

Two bar graphs labeled (A and B) compare weight values across serotypes for different vaccine types: PCV7, PCV13, non-vaccine type (NVT), and NA. Graph (A) presents weight values on a linear scale, while Graph (B) presents weight values on a log scale.

Weight value for each serotype. The weight was developed from the serotype distribution in the pre-vaccine epoch in the United States (CDC ABCs IPD data from 1998 to 2000 and carriage data collected in Massachusetts and Southwest United States from 1998 to 2001). There are 65 different serotypes, including the not-assigned (NA) group. (A) Linear scale. The orange bars represent PCV7-covered serotypes, and the blue bars represent the additional six serotypes in PCV13. The red dotted line indicates the weight value is equal to 1, and the blue dotted line indicates the weight value is equal to 15. (B) Log scale. These vaccine-covered serotypes generally had weight values less than or close to 1. Many non-vaccine types (NVTs) had weight values greater than 1. PCV7, 7-valent pneumococcal conjugate vaccine; PCV13, 13-valent pneumococcal conjugate vaccine; NVT, non-vaccine types; NA, serotype not assigned.

Some serotypes had extreme weight values (≥15), but these made up a very small proportion of the IPD data (approximately 0.01%). There were four very small weight values (Fig. 1B, weight <0.1 on the log scale), including 12F, 16, 7F, and 9A, among which 7F is one of the PCV13 vaccine serotypes. These serotypes were only present in the IPD data set; to get a hypothetical weight, we added a very small count for these serotypes in carriage data assuming that they were present at a negligible or non-detectable level. These four serotypes accounted for 6.4% of IPD isolates.

Accessory gene frequency correlations before and after the application of weights

We assessed the effects of the weights applied to the IPD data by comparing the correlations in accessory gene frequency (i) between the carriage population and the invasive population and (ii) between the different time periods in invasive data, with and without weights. If weighting improved the representativeness of the IPD data for the carriage population, we would expect that these correlations would improve with the application of the weights. Indeed, the correlations of accessory gene frequencies between the carriage data and the IPD data in the linear scale were improved or in no case reduced for pre-vaccine (before weights: 0.92 vs after weights: 0.94), post-PCV7 (0.91 vs 0.93), and post-PCV13 (0.95 vs 0.95) (Fig. 2). The corresponding residual sum of squares was also reduced after applying the weights. Similar findings were obtained using logit-transformed frequencies where the weights improved or in no case reduced the correlation between the carriage data and the IPD data for pre-vaccine (before weight: 0.90 vs after weight: 0.91), post-PCV7 (0.88 vs 0.88), and post-PCV13 (0.89 vs 0.91) (Fig. S1). The improved correlations suggest that the weights adjusted the invasive population to better reflect the carriage population.

Fig 2.

Six scatter plots compare acccessory gene frequency in carriage vs IPD before and after weight adjustments in pre-vaccine, post-PCV7, and post-PCV13 epoch. The top row displays data before adjustments, while the bottom row presents data after adjustments.

Correlations of accessory gene frequencies between the carriage data and the IPD data before and after weights. The upper panels (A–C) represent the correlations between the carriage data and the IPD data before the application of weights for pre-vaccine (A), post-PCV7 (B), and post-PCV13 (C). The lower panels (D–F) show the correlations for each epoch after the application of weights. We observed both visually tighter scatterplots and improved correlations after weights. COG, accessory clusters of orthologous genes; PCV7, 7-valent pneumococcal conjugate vaccine; PCV13, 13-valent pneumococcal conjugate vaccine; R, correlation coefficient.

Given the previous observations (8) in carriage data that accessory gene frequencies tended to return to the pre-vaccine equilibrium frequencies, we compared the correlations between populations before and after vaccination—pre-vaccine versus post-PCV7, pre-vaccine versus post-PCV13, and post-PCV7 versus post-PCV13—in the IPD before (Fig. 3A through C) and after applying the weights (Fig. 3D through F). We observed that applying the weights to the IPD data improved the correlation of accessory gene frequencies between different epochs in the invasive data; similar information was revealed in the logit-transformed frequencies (Fig. S2).

Fig 3.

Six scatter plots compare accessory gene frequency in IPD before and after weight adjustments between each two vacccine epochs. The top row displays data before adjustments, while the bottom row presents data after adjustments.

Correlations of accessory gene frequencies between different epochs in the IPD data before and after application of weights. (A–C) show the correlations before weights in the IPD data for pre-vaccine versus post-PCV7 (A), pre-vaccine versus post-PCV13 (B), and post-PCV7 versus post-PCV13 (C). (D–F) show the correlations after weights in the IPD data for different epochs. We observed that the weights improved the correlations in the IPD data with both visually tighter scatterplots and improved correlation coefficients. COG, accessory clusters of orthologous genes; PCV7, 7-valent pneumococcal conjugate vaccine; PCV13, 13-valent pneumococcal conjugate vaccine; R, correlation coefficient.

Post-PCV13 strain proportion prediction

To note, the term “strain” refers to the genetic cluster of closely related sequences identified by the population partitioning using nucleotide k-mers (PopPUNK) (9) based on the whole-genome assembly. Given the encouraging results obtained with improved accessory gene frequency correlations after the application of weights, the NFDS model (6) was tested for predicting the post-PCV13 carriage population using IPD data (isolates n = 11,294 from 2009 to 2013 collected in ABCs; details of Epoch definition in Supplemental Data 1, Table d1). Initially, 159 Global Pneumococcal Sequence Clusters (GPSCs) (10) were classified by PopPUNK, among which 96 GPSCs (n = 158 isolates) contained ≤5 sequences (Fig. S3). To reduce the impact of singular or very small GPSCs on the prediction, we removed SCs with ≤5 isolates (i.e., 96 SCs with 158 sequences), retaining 11,136 isolates and 63 SCs. Genetic diversity analysis of these 63 SCs showed that isolates within each SC clustered closely on the core genome maximum-likelihood phylogeny (Supplemental Data 1, Fig. d4 and Table d3). The majority of SCs contained mixed vaccine and non-vaccine serotypes (specific serotypes within each SC in Supplemental Data 1, Table d4).

We further compared the observed strain proportions in the IPD data at post-PCV13 E1 and E3 epochs, before (Fig. 4A) and after (Fig. 4B) weights applied, after removing SCs ≤5 isolates. The weights generally reduced the VTs in the strain and increased NVTs. The mixed strains could be increased or decreased depending on the proportion of vaccine/non-vaccine serotypes in a mixed strain.

Fig 4.

Two bar graphs display strain proportions for pre-vaccine and post-vaccine period with different color shades. The bars are color-coded by strain composition of non-vaccine, vaccine, or mixed serotypes. The top is before weights and the bottom is after.

Strain proportion before and after application of weights. (A) is the strain proportion before the application of weights, and (B) is after. The name of the strain indicated the original GPSC called by PopPUNK. Sixty-three strains with >5 isolates were shown. The weights generally reduced the vaccine types in the strain and increased non-vaccine types. The mixed strains could be increased or decreased depending on the proportion of vaccine/non-vaccine serotypes in a mixed strain. The pre-vaccine and post-vaccine indicated pre-PCV13 and post-PCV13, respectively.

The NFDS model after weighting predicted the strain proportions in the projected carriage population better as compared with before weighting (Fig. 5). The regression between the observed and predicted strain proportions was improved, with the adjusted R2 increasing from 0.254 before weights to 0.545 after weights, indicating the impact of the bias introduced by the IPD sample on the power of the NFDS model; and the inverse-invasiveness weighting substantially improved the predictive power of the NFDS model. Bootstrap tests showed that the improvement on the strain proportion prediction after applying weights was statistically significant (P value = 0.03 based on SSE; P value = 0 based on adjusted R2). The regression coefficient between predicted and observed strain proportions was 0.77 [95% confidence interval (CI): 0.43–1.11] after the application of weights. In addition, the sum of the squared prediction error (SSE) and the root mean square error (RMSE) were significantly reduced after weighting. In the prediction model, outliers were defined as points for which the difference between the predicted and observed proportions was >1.5 times the interquartile range of the distribution of predicted and observed proportion differences. We found some outliers with a higher observed proportion than the predicted value, including SC-12 and SC-7. SC-12 contained 93.9% serotype 15A isolates, and SC-7 contained 38.8% of 23A and 60.1% of 23B. The outliers with higher predicted proportion than the observed included SC-4 (60.2% of 15BC and 37.7% of 19A), SC-15 (84.1% of 6C), and SC-78 (94.8% of 6C). These outliers may be related to the potential cross-reactivity of their serotypes with vaccine types in PCVs or an expanded ecological niche after the vaccine types were removed.

Fig 5.

Two scatter plots labeled (A and B) compare observed versus predicted strain proportions, with and without weighting, highlighting deviations from perfect predictions. Strain types are indicated, with key metrics provided.

The NFDS model predicted strain proportions before and after applying weights. Strains (SCs) were GPSC clusters with >5 isolates. There was a total of 63 SCs with GPSC >5 isolates, among which 53 strains contained at least one NVT isolate pre-vaccine or were imputed with one isolate pre-vaccine. The scatterplot of observed versus predicted proportions of 53 strains at post-PCV13 equilibrium is based on quadratic programming. These perfect predictions fall on the dotted line of equality (1:1 line). The shaded gray region shows the confidence interval from the linear regression model used to test for deviation of the observed versus predicted values compared with the 1:1 line. (A) is the post-PCV13 E3 strain proportion prediction based on the IPD data with 2009 as E1 and 2017–2018 as E3 before applying weights, and (B) is after applying weights. Outliers are defined as the difference between their predicted and observed proportion that is >1.5 times the interquartile range of the distribution of predicted and observed proportion differences. There were no outliers in the pre-weighting prediction. The SC labels in (A) showed where the outliers from post-weighting prediction were located in the pre-weighting prediction plot. In the prediction after applying weights, the outliers with higher observed proportion than the predicted are SC-7 and SC-12. The SC-7 contained 38.8% of serotype 23A and 60.1% of 23B, and the SC-12 contained 93.9% of serotype 15A. The outliers with higher predicted proportion than the observed included SC-4 (60.2% of 15BC and 37.7% of 19A), SC-15 (84.1% of 6C), and SC-78 (94.8% of 6C). Adj. R2, adjusted R-squared; Mixed, strain containing both vaccine type and non-vaccine type; NFDS, negative frequency-dependent selection; NVT, non-vaccine type; PCV13, 13-valent pneumococcal conjugate vaccine; RMSE, root mean square error; SC, sequence cluster or strain; SSE, sum of squares due to error.

Sensitivity analysis

Due to the presence of many singular or small sequence clusters in the GPSC assignment, we tested different cutoff values to exclude these small SCs. In the main prediction analysis, we excluded SCs with ≤5 isolates. In the sensitivity analysis, we tested the impact of excluding SCs with ≤3 isolates and SCs with ≤10 isolates. Out of the 11,294 sequences and 159 GPSCs assigned by PopPUNK, we excluded 89 SCs (n = 128 sequences) using the ≤3 isolates criterion, resulting in 11,166 isolates and 70 SCs included in the model. The NFDS model predicted strain proportion more accurately after weights, with the adjusted R2 increasing from 0.284 before weights to 0.554 after weights (Fig. S4). Using the ≤10 isolates criterion, we excluded 104 SCs (n = 228 isolates), resulting in 11,066 isolates and 55 SCs included in the model. The NFDS prediction model’s adjusted R2 increased from 0.267 before weights to 0.535 after weights (Fig. S5). The sensitivity analyses demonstrated that different cutoffs for removing small GPSCs had negligible variations on the NFDS prediction. The conclusion that applying weights significantly improved the accuracy of strain proportion prediction for post-PCV13 held true across all three GPSC cutoff thresholds tested.

Epidemiological surveillance showed that PCV13 did not decrease the carriage prevalence of serotype 3 after vaccine introduction even though serotype 3 was a component of the vaccine; therefore, in a separate sensitivity analysis, we repeated the NFDS model prediction but treated serotype 3 as a NVT. Serotype 3 had 1,442 sequences (12.9% before weights) in the IPD data with GPSCs containing >5 sequences. Serotype 3 were mainly in SC-14 (n = 1,403), SC-89 (n = 12), SC-59 (n = 10), and SC-279 (n = 10). The NFDS model significantly improved the prediction accuracy after applying weights, where adjusted R2 increased from 0.378 before the application of weights to 0.553 afterward (Fig. S6). The adjusted R2 was generally improved when compared with treating serotype 3 as a VT (Fig. 5). The outliers after weighting when treating serotype 3 as a NVT were the same as treating serotype 3 as a VT. However, two outliers were reported before weighting when treating serotype 3 as a NVT: SC-14 (99.9% of serotype 3) and SC-4 (60.2% of serotype 15BC and 37.3% of 19A).

DISCUSSION

The application of our weighting system to the NFDS model significantly enhanced its predictive power for strain proportions in the post-PCV13 epoch. This improvement highlights the potential of using genomic data from invasive disease surveillance to infer pneumococcal population dynamics in the carriage population. By addressing the bias in invasive disease samples through weighting, our study bridges a critical gap, enabling more accurate predictions even when direct carriage data is limited. This methodological advancement underscores the value of invasive disease data not only for understanding disease prevalence but also for gaining insights into evolutionary mechanisms and population dynamics. Our findings suggest that such weighted models could be instrumental in future vaccine impact assessments and epidemiological studies, offering a robust framework for predicting pathogen behavior in response to vaccination programs.

The weight value estimates the inverse of the invasiveness of each serotype based on the S. pneumoniae population in the United States. Previous studies have quantified the tendency of different serotypes to cause invasive disease by the case-to-carrier ratio or invasiveness index (11, 12). The major determinant of invasiveness is the capsular serotype, and most serotypes have known invasiveness estimates from epidemiological or experimental data (11, 1315). Therefore, invasive disease data can reflect the carriage population, but with the more invasive serotypes overrepresented. We used a composite data set of a carriage study primarily in children <5 and one in children <7 years old, and an all-ages invasive disease data set to arrive at our weights. This represents a working hypothesis that carriage in young children is the primary population for pneumococcal transmission and evolution, consistent with evidence that young children are key drivers of pneumococcal population dynamics. Several lines of evidence suggest this role, most notably the substantial drop in IPD in all ages following infant and toddler vaccination with PCV7 (16, 17). The use of an all-ages invasive disease data set, ABCs, to develop the weights was appropriate because it was this data set that was used for NFDS model predictions, and the purpose of the weights is to transform a given invasive disease data set to frequencies more representative of carriage in the population that is undergoing evolutionary pressure. In addition, the weight was calculated based on the serotype distribution during the pre-vaccine epoch during 1998–2001. This is a deliberate methodological choice because the pre-vaccine serotype-specific weight represents when the population was under equilibrium and not perturbed by the PCVs, serving the purpose of performing the NFDS prediction model. But it can be one of the reasons for reducing the model’s fit to the data because in several instances the clonal compositions of serotypes changed between pre-PCV7 and subsequent epochs. This imposes a limitation of serotype-specific weighting. A potentially more sophisticated approach is to compute a strain-specific weight considering the serotype component change before and after vaccination.

PCVs have been effective at reducing pneumococcal disease worldwide. However, replacement with non-PCV serotypes remains a concern, such as the increase of serotype 19A after PCV7 introduction in the United States in 2000 (18, 19). After PCV13 introduction, replacement was also observed in the United Kingdom (with the important proviso that serotype 3 remained a consistent cause of IPD) (2022). In the prediction model our sensitivity analysis, treating serotype 3 as a NVT, generated a more accurate prediction of the strain proportion post-PCV13 in both pre- and post-weighting. We also observed some outliers, SCs which made up a proportion of the equilibrium population that was either substantially more or less than predicted. When examining the components of serotypes in these strains, we found the strains (SC-4, 15, and 78) that were overestimated by the NFDS model were associated with serotypes 15BC and 6C; in the case of 6C this could represent partial cross protection against 6C from the 6A and 6B components of PCV13 (23). We also found that serotypes 23A, 23B, and 15A were common among those the model underestimated (SC-7 and 12), suggesting that these serotypes had additional advantages compared with other NVTs. One explanation may be that the proportion of penicillin-nonsusceptible clones in 15A, 23A, and 23B have dramatically increased post-PCV7, and this has continued into post-PCV13 epoch (24). Although not observed in the United States, an increase in serotype 24F caused a sizeable increase in pneumococcal meningitis cases in many countries after 5 years of the PCV13 rollout (25). The advantage of serotype 24F, belonging to GPSC10, was potentially driven by multidrug resistance and lack of immunity in the host population (25). Hence, to fully explain the response of the pneumococcal population after the introduction of successive PCVs in the United States, other mechanisms should be investigated with continued pneumococcal invasive disease or carriage population surveillance.

There are some limitations in this study. First, sampling biases in host age and/or geographic regions existed in the carriage data and the invasive disease data, resulting in inaccuracy in the weight system. Although the CDC ABCs invasive disease data collected from 10 states/regions across the United States were reasonably representative of the population diversity, the carriage data were collected from Massachusetts and the Southwest United States and may not have fully represented the serotype and strain diversity across the entire United States. Additional noise may have arisen from errors in serotype identification. Although the whole-genome approach to identifying the sample serotypes was highly consistent with laboratory identification, some errors or ambiguous assignments occurred (26, 27). Second, weights developed from the US data may not work for S. pneumoniae populations in other locations because of spatial heterogeneity in the bacterial population. For example, in the carriage population from Maela, Thailand during 2001–2007 in the absence of vaccination, the serotype distribution and strain diversity were very different from other locations in the pre-vaccine population (5, 28). However, our weight values were highly correlated with the weights developed using epidemiological data from different global regions, suggesting these values could be used for other locations, but interpretation should take the heterogeneity into account. Third, in the NFDS model, we assumed that after 7 years of the PCV13 rollout, the pneumococcal carriage population had reached post-PCV13 equilibrium. However, previous evidence has shown that the population took approximately 10 years after PCV7 to reach equilibrium (8). As the CDC ABCs are ongoing, more data will become available for further testing the model.

In conclusion, the weight system effectively adjusted invasive disease surveillance data to better represent the genomic components of the carriage population of S. pneumoniae. Weighting allowed us to predict the carriage strain proportion in the population in the post-PCV13 epoch, although other mechanisms should be further investigated to explain certain emerging strains after the rollout of PCVs. Our methods have enriched the value of genomic sequences from invasive disease surveillance that is readily available, easy to collect, and of direct interest to public health. Furthermore, the use of NFDS models has broader implications beyond S. pneumoniae. NFDS has been observed to impact population dynamics in other bacteria such as Escherichia coli, Pseudomonas aeruginosa, and Enterobacter cloacae (29, 30). These applications highlight NFDS as a powerful tool for understanding the evolutionary mechanisms shaping microbial diversity, predicting bacterial population dynamics, and informing public health strategies.

MATERIALS AND METHODS

Study population

We used three sources of data: i) carriage data from children aged <7 years in Massachusetts, United States (hereafter referred to as MA data) (31, 32), (ii) carriage data from Native American communities in the Southwest United States (hereafter referred to as Southwest US data) (3335), and (iii) invasive pneumococcal disease data from CDC ABCs (hereafter referred to as IPD data) (3639).

In the MA data, pneumococcal isolates were collected from nasopharyngeal swabs of children under the age of 7 years at a participating primary care provider in communities throughout Massachusetts (31, 32). Samples were collected between October and April of 2000–2001, 2003–2004, 2006–2007, 2008–2009, 2010–2011, and 2013–2014. This final data set contained 1,314 genomic sequences and associated metadata.

The Southwest US data included pneumococci isolated from a subset of participants in three prospective, observational cohort studies of pneumococcal carriage among southwest Native American individuals from 1998 through 2012 described elsewhere (6, 3335). Briefly, participants living on Tribal lands in the Southwest United States were enrolled during three periods: 1998–2001, 2006–2008, and 2010–2012. Nasopharyngeal swab specimens were obtained during visits to healthcare facilities or the participant homes to determine pneumococcal carriage status. Except for a subset of isolates collected from 2006 to 2008, all isolates were obtained from children <5 years. The Southwest US data included 937 genomic sequences and associated metadata that were previously published using a random selection of isolates from each time period, with an oversampling of isolates post-PCV7.

All the IPD data from 1998 through 2018 were identified through CDC ABCs, an active population- and laboratory-based system that covers a population of approximately 34.5 million individuals in 10 geographic locations in the United States, including California, Colorado, Connecticut, Georgia, Maryland, Minnesota, New Mexico, New York State, Oregon, and Tennessee. ABCs areas, methods, and key surveillance data through 2018 are described elsewhere (3739). Briefly, ABCs personnel routinely contacted all microbiology laboratories serving acute care hospitals in their area to identify cases. Standardized case report forms that included information on demographic characteristics, clinical syndrome, and outcome of illness were completed for each identified case. Strain characterization was conducted on all pneumococcal isolates, which included identifying capsular serotype and determining antimicrobial resistance. A case of IPD was defined as S. pneumoniae detection from a normally sterile site in a surveillance area resident; common IPD syndromes included bacteremia with pneumonia, bacteremia without focus, and meningitis. Isolates were assigned serotypes at CDC’s Streptococcus Laboratory or the Minnesota Department of Health using Quellung (1998–2015) and whole-genome sequencing (WGS; 2015–2018) (40). The final IPD data set had 11,795 sequences and associated metadata.

Epoch definition

Epochs were defined to determine which time periods were used to generate weights and which were used to model different periods of vaccination. To generate weights, the MA data from 2000–2001 were defined as pre-vaccine epoch (E1) because PCV7 started rollout for children in 2000, and we considered that this vaccine had not yet perturbed the bacterial population; 2003–2009 as post-PCV7 epoch (E2); and 2010–2014 as post-PCV13 epoch (E3). With the Southwest US data, we defined 1998–2001, 2006–2008, and 2010–2012 as E1, E2, and E3, respectively. With the IPD data, for the weight development, we defined 1998–2000 as E1, 2009 as E2, and 2013–2018 as E3. The pre-vaccine (E1) from the three data sets were used to create weights. Then, E1, E2, and E3 were used to evaluate the impacts of weights on accessory gene frequencies before and after weighting.

To use IPD data to predict post-PCV13 population dynamics, we defined 2009 as the pre-vaccine period for PCV13 (Pre-PCV13 E1), assuming that the population had reached a new equilibrium post-PCV7; 2013–2016 as Post-PCV13 E2, the removal stage of vaccine types in the population; and 2017–2018 as Post-PCV13 E3, assuming that the new equilibrium post-PCV13 had been reached. In this study, the NFDS model attempted to predict the strain proportions in 2017–2018, as this was the longest time period after the introduction of PCV13 for which data were available. Specific data distribution by year, geographic location, and population, as well as detailed data sets used to create weights and conduct the NFDS prediction model, can be found in Supplemental Data 1.

Serotype-specific weights

Invasiveness measures for pneumococcal serotypes have been defined in various ways, each monotonically associated with the ratio of invasive disease incidence to carriage prevalence for that serotype in a particular population (11, 14). While other determinants may play a role in invasiveness, serotype accounts for much of the variation in this property (1113, 15). We therefore chose to use an inverse measure of invasiveness as a weight to convert invasive disease abundance of isolates carrying each serotype to an estimate of the frequency of such isolates in carriage. Specifically, using data from the pre-vaccine epoch of 1998–2001 in the United States, the weight for each serotype is the ratio of the proportion of the serotype in the carriage data (i.e., the combined MA and Southwest US data sets from 1998 to 2001) to the proportion in the IPD data from 1998 to 2000. We used all cases in the ABCs data from the pre-vaccine period, whether the isolate from the case was sequenced or not. For serotypes that were only present in the IPD data, we added a small count (n = 0.05) to each of these serotypes in the carriage data to generate an approximate weight value. For those serotypes that were only present in the carriage data, we did not calculate a weight. For the serotypes that were neither in the IPD pre-vaccine epoch nor in the carriage data but emerged after the vaccine was introduced, we applied a very small count (0.05 for each of the serotypes in the carriage data and proportionally 0.85 for the IPD data) to generate a hypothetical weight that was close to 1. Some ambiguous serotypes, for which inference from whole-genome sequence was not possible, were assigned to a most likely serotype as described in Supplemental Methods 1.

Bioinformatics pipeline

All genomic and associated epidemiological data are publicly available (8, 31, 32, 38, 39) (National Center for Biotechnology Information NCBI accession numbers provided in Supplemental Data 2). Records in NCBI showed that the MA and Southwest US carriage isolates were sequenced with Illumina HiSeq 2000 and the IPD isolates were sequenced with Illumina MiSeq, indicating the three datasets had comparable sequencing platforms. Raw sequence reads were downloaded from the Sequence Read Archive (SRA) in the NCBI database. Genomic assembly from raw reads was constructed using SPAdes v3.10 (41) and annotated using Prokka v1.11 (42) embedded in Unicycler (43). Quast v4.4 (44, 45) was used to assess the quality of each genomic assembly. A sequence assembly was excluded when it had (i) an N50 less than 15 kb; or (ii) ≥500 contigs, indicating the genome was too segmented; or (iii) a genome length <1.9 Mb or >2.4 Mb; or (iv) a GPSC (10) not assigned due to “genomes with outlying lengths detected.” With the reference genome ATCC 700669 (5), Roary v3.10.0 (46) was then used to identify core and accessory genes. The core genes were defined as being present in >99% of the isolates to generate a core gene alignment, whereas accessory genes were present in 5–95% of the isolates and formed into a matrix of absence (value of 0) and presence (value of 1) for each sample. There were 3,166 accessory genes identified in the IPD population.

Strain definition

The genetically similar sequence clusters (SCs)—strains—were defined by the population partitioning using nucleotide k-mers (PopPUNK) (9) based on the whole-genome assembly. The PopPUNK approach followed instructions from the Global Pneumococcal Sequencing project (https://www.pneumogen.net/gps/assigningGPSCs.html). The advantages of SCs defined by PopPUNK are that they have a global definition and can be compared across different populations. However, the major challenge posed by the approach is that many SCs contain one sequence or just a few sequences that lost the power of prediction due to limited information in the SC, resulting in the removal of these SCs to conduct the NFDS prediction model for post-PCV13 epoch.

Weights

To adjust the accessory gene component of the invasive disease population to represent carriage, we applied the weights to the invasive accessory gene matrix of CDC ABCs that were collected and sequenced from 1998 to 2018. This was done by directly multiplying these weights by the accessory gene matrix of the absence (value of 0) and presence (value of 1). Calculation of accessory gene frequency before and after weights was demonstrated in Supplemental Methods 2. We assessed the effect of the weights cross-sectionally with the correlation of accessory gene frequencies between the carriage data and the invasive data before and after weights, and longitudinally with the correlation between pre- and post-vaccine in the invasive data before and after weights. The correlation coefficient and the residual sum of squares (RSS) were used to measure whether the correlations of accessory gene frequencies were improved before and after weights. To reduce heteroskedasticity, we repeated these analyses with logit-transformed frequencies. The analysis was conducted in RStudio v1.0.143 with R v4.1.2.

To adjust strain distribution in the invasive disease data to represent carriage, we calculated strain proportions based on serotype-specific weights, that is, the weights were applied to adjust the serotype components in each strain. Calculations of strain proportion before and after weights were demonstrated in Supplemental Methods 2.

NFDS prediction model

Using the 3,166 accessory genes present in 5–95% of isolates in the IPD data, pre-vaccine accessory gene frequencies were calculated for each strain, focusing on NVT isolates only. We considered strains that (i) had NVT isolates present pre-vaccine and (ii) were not too small if assigned by PopPUNK with whole-genome assembly. The later was achieved by removing small SCs with only keeping SCs with >5 isolates. We applied the weights to the invasive accessory gene matrix of CDC ABCs that were collected and sequenced from 1998 to 2018 by directly multiplying these weights by the accessory gene matrix of the absence (value of 0) and presence (value of 1) to adjust the COG frequency. In addition, to adjust the observed strain proportion, the weights were applied to adjust the serotype components in each strain (Supplemental Methods 2). We then computed the predicted proportion of each strain such that post-vaccine accessory gene frequencies approached pre-vaccine frequencies as closely as possible by using a quadratic programming approach constructed using the package quadprog v1.5-5 implemented in RStudio v1.0.143 with R v4.1.2 (6). Details of this implementation can be found in the R code provided (https://github.com/c2-d2/Predicting_Pneumo_postPCV13). Some SCs had no presence in the pre-vaccine (Pre-PCV13 E1) epoch but emerged in the post-vaccine (Post-PCV13 E2 or E3) epochs: if one SC emerged in Post-PCV13 E2 and E3, then COG frequencies from Post-PCV13 E2 were used as the pre-vaccine equilibrium frequency with the assumption that one isolate in the pre-vaccine had its strain frequency; and if one SC emerged only in Post-PCV13 E3, then COG frequencies from Post-PCV13 E3 were used as the pre-vaccine equilibrium frequency with assuming one isolates in the pre-vaccine to impute its strain frequency. In our data set, the majority of imputed SCs used COG frequencies from Post-PCV13 E2 as the proxy.

We then conducted the NFDS prediction models before and after weighting. For each model, accuracy would be reflected by a slope close to one and an intercept close to 0 in the regression between the predicted and observed strain proportions. Goodness-of-fit statistics including sum of squares due to error (SSE), root mean squared error (RMSE), and degrees of freedom adjusted R2 (Adj. R2) were used to evaluate each model. To test the statistical significance of the NFDS prediction model before and after applying weights, we performed 1000 bootstraps on sequence isolates in the Post-PCV13 E3 epoch. We then calculated the R² and SSE of the regression between the predicted and observed strain proportions for the models with and without weights. The P value was reported as two times the proportion of cases where the prediction without weights was better than the prediction with weights.

Sensitivity analysis

Sensitivity analyses were conducted. Because PopPUNK-assigned GPSCs contained many small clusters with only one or a few isolates, in the main analysis of NFDS prediction model, only GPSCs with >5 isolates were included; and in the sensitivity analysis, GPSCs with >3 isolates or with >10 isolates were included for two distinct analyses.

Epidemiological surveillance showed that serotype 3 in PCV13 did not decrease the prevalence of this serotype over the years after the vaccine was introduced (2022). Thus, in a separate sensitivity analysis, serotype 3 was treated as a NVT to recalculate the strains that had NVT isolates present pre-vaccine and conducted the prediction model.

ACKNOWLEDGMENTS

X.Q. and M.L. thank the funding support from the Waking Up, Lantern Ventures, the Morris-Singer Fund, and Award Number U54GM088558 from the National Institute of General Medical Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health.

Contributor Information

Xueting Qiu, Email: xuetingqiu@hsph.harvard.edu.

Marc Lipsitch, Email: mlipsitc@hsph.harvard.edu.

Paul Keim, Northern Arizona University, Flagstaff, Arizona, USA.

DATA AVAILABILITY

Whole-genome sequencing data are publicly available before the initiation of the study in NCBI under BioProject number PRJEB2632 , PRJEB8327, and PRJNA284954. Accession numbers and accompanying metadata have previously been published. The list of NCBI accession numbers for used sequencing data and all R scripts to perform weighting and prediction can be found in GitHub Repository (https://github.com/c2-d2/Predicting_Pneumo_postPCV13).

SUPPLEMENTAL MATERIAL

The following material is available online at https://doi.org/10.1128/mbio.03355-23.

Supplemental material. mbio.03355-23-s0001.docx.

Supplemental text, tables, and figures.

mbio.03355-23-s0001.docx (3.6MB, docx)
DOI: 10.1128/mbio.03355-23.SuF1

ASM does not own the copyrights to Supplemental Material that may be linked to, or accessed through, an article. The authors have granted ASM a non-exclusive, world-wide license to publish the Supplemental Material files. Please contact the corresponding author directly for reuse.

REFERENCES

  • 1. Fitzwater SP, Chandran A, Santosham M, Johnson HL. 2012. The worldwide impact of the seven-valent pneumococcal conjugate vaccine. Pediatr Infect Dis J 31:501–508. doi: 10.1097/INF.0b013e31824de9f6 [DOI] [PubMed] [Google Scholar]
  • 2. Chen C, Cervero Liceras F, Flasche S, Sidharta S, Yoong J, Sundaram N, Jit M. 2019. Effect and cost-effectiveness of pneumococcal conjugate vaccination: a global modelling analysis. Lancet Glob Health 7:e58–e67. doi: 10.1016/S2214-109X(18)30422-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Weinberger DM, Malley R, Lipsitch M. 2011. Serotype replacement in disease following pneumococcal vaccination: a discussion of the evidence. Lancet 378:1962. doi: 10.1016/S0140-6736(10)62225-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Feikin DR, Kagucia EW, Loo JD, Link-Gelles R, Puhan MA, Cherian T, Levine OS, Whitney CG, O’Brien KL, Moore MR, Serotype Replacement Study Group . 2013. Serotype-specific changes in invasive pneumococcal disease after pneumococcal conjugate vaccine introduction: a pooled analysis of multiple surveillance sites. PLoS Med 10:e1001517. doi: 10.1371/journal.pmed.1001517 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Corander J, Fraser C, Gutmann MU, Arnold B, Hanage WP, Bentley SD, Lipsitch M, Croucher NJ. 2017. Frequency-dependent selection in vaccine-associated pneumococcal population dynamics. Nat Ecol Evol 1:1950–1960. doi: 10.1038/s41559-017-0337-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Azarian T, Martinez PP, Arnold BJ, Qiu X, Grant LR, Corander J, Fraser C, Croucher NJ, Hammitt LL, Reid R, Santosham M, Weatherholtz RC, Bentley SD, O’Brien KL, Lipsitch M, Hanage WP. 2020. Frequency-dependent selection can forecast evolution in Streptococcus pneumoniae. PLoS Biol 18:e3000878. doi: 10.1371/journal.pbio.3000878 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Bogaert D, De Groot R, Hermans PWM. 2004. Streptococcus pneumoniae colonisation: the key to pneumococcal disease. Lancet Infect Dis 4:144–154. doi: 10.1016/S1473-3099(04)00938-7 [DOI] [PubMed] [Google Scholar]
  • 8. Azarian T, Grant LR, Arnold BJ, Hammitt LL, Reid R, Santosham M, Weatherholtz R, Goklish N, Thompson CM, Bentley SD, O’Brien KL, Hanage WP, Lipsitch M. 2018. The impact of serotype-specific vaccination on phylodynamic parameters of Streptococcus pneumoniae and the pneumococcal pan-genome. PLoS Pathog 14:e1006966. doi: 10.1371/journal.ppat.1006966 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. 2019. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res 29:304–316. doi: 10.1101/gr.241455.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Gladstone RA, Lo SW, Lees JA, Croucher NJ, van Tonder AJ, Corander J, Page AJ, Marttinen P, Bentley LJ, Ochoa TJ, et al. 2019. International genomic definition of pneumococcal lineages, to contextualise disease, antibiotic resistance and vaccine impact. EBioMedicine 43:338–346. doi: 10.1016/j.ebiom.2019.04.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Yildirim I, Hanage WP, Lipsitch M, Shea KM, Stevenson A, Finkelstein J, Huang SS, Lee GM, Kleinman K, Pelton SI. 2010. Serotype specific invasive capacity and persistent reduction in invasive pneumococcal disease. Vaccine (Auckl) 29:283–288. doi: 10.1016/j.vaccine.2010.10.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Yildirim I, Shea KM, Little BA, Silverio AL, Pelton SI, Members of the Massachusetts Department of Public Health . 2015. Vaccination, underlying comorbidities, and risk of invasive pneumococcal disease. Pediatrics 135:495–503. doi: 10.1542/peds.2014-2426 [DOI] [PubMed] [Google Scholar]
  • 13. Hyams C, Trzcinski K, Camberlein E, Weinberger DM, Chimalapati S, Noursadeghi M, Lipsitch M, Brown JS. 2013. Streptococcus pneumoniae capsular serotype invasiveness correlates with the degree of factor H binding and opsonization with C3b/iC3b. Infect Immun 81:354–363. doi: 10.1128/IAI.00862-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Brueggemann AB, Peto TEA, Crook DW, Butler JC, Kristinsson KG, Spratt BG. 2004. Temporal and geographic stability of the serogroup-specific invasive disease potential of Streptococcus pneumoniae in children. J Infect Dis 190:1203–1211. doi: 10.1086/423820 [DOI] [PubMed] [Google Scholar]
  • 15. Brueggemann AB, Griffiths DT, Meats E, Peto T, Crook DW, Spratt BG. 2003. Clonal relationships between invasive and carriage Streptococcus pneumoniae and serotype- and clone-specific differences in invasive disease potential. J Infect Dis 187:1424–1432. doi: 10.1086/374624 [DOI] [PubMed] [Google Scholar]
  • 16. Pilishvili T, Lexau C, Farley MM, Hadler J, Harrison LH, Bennett NM, Reingold A, Thomas A, Schaffner W, Craig AS, Smith PJ, Beall BW, Whitney CG, Moore MR, Active Bacterial Core Surveillance/Emerging Infections Program Network . 2010. Sustained reductions in invasive pneumococcal disease in the era of conjugate vaccine. J Infect Dis 201:32–41. doi: 10.1086/648593 [DOI] [PubMed] [Google Scholar]
  • 17. Tan TQ. 2012. Pediatric invasive pneumococcal disease in the United States in the era of pneumococcal conjugate vaccines. Clin Microbiol Rev 25:409–419. doi: 10.1128/CMR.00018-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Hanage WP, Bishop CJ, Lee GM, Lipsitch M, Stevenson A, Rifas-Shiman SL, Pelton SI, Huang SS, Finkelstein JA. 2011. Clonal replacement among 19A Streptococcus pneumoniae in Massachusetts, prior to 13 valent conjugate vaccination. Vaccine (Auckl) 29:8877–8881. doi: 10.1016/j.vaccine.2011.09.075 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Pai R, Moore MR, Pilishvili T, Gertz RE, Whitney CG, Beall B, Active Bacterial Core Surveillance Team . 2005. Postvaccine genetic structure of Streptococcus pneumoniae serotype 19A from children in the United States. J Infect Dis 192:1988–1995. doi: 10.1086/498043 [DOI] [PubMed] [Google Scholar]
  • 20. Andrews N, Kent A, Amin-Chowdhury Z, Sheppard C, Fry N, Ramsay M, Ladhani SN. 2019. Effectiveness of the seven-valent and thirteen-valent pneumococcal conjugate vaccines in England: the indirect cohort design, 2006-2018. Vaccine (Auckl) 37:4491–4498. doi: 10.1016/j.vaccine.2019.06.071 [DOI] [PubMed] [Google Scholar]
  • 21. Groves N, Sheppard CL, Litt D, Rose S, Silva A, Njoku N, Rodrigues S, Amin-Chowdhury Z, Andrews N, Ladhani S, Fry NK. 2019. Evolution of Streptococcus pneumoniae serotype 3 in England and Wales: a major vaccine evader. Genes (Basel) 10:845. doi: 10.3390/genes10110845 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Lapidot R, Shea KM, Yildirim I, Cabral HJ, Pelton SI, the Massachusetts Department of Public Health . 2020. Characteristics of serotype 3 invasive pneumococcal disease before and after universal childhood immunization with PCV13 in Massachusetts. Pathogens 9:396. doi: 10.3390/pathogens9050396 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Dagan R, Patterson S, Juergens C, Greenberg D, Givon-Lavi N, Porat N, Gurtman A, Gruber WC, Scott DA. 2013. Comparative immunogenicity and efficacy of 13-valent and 7-valent pneumococcal conjugate vaccines in reducing nasopharyngeal colonization: a randomized double-blind trial. Clin Infect Dis 57:952–962. doi: 10.1093/cid/cit428 [DOI] [PubMed] [Google Scholar]
  • 24. Gertz RE, Li Z, Pimenta FC, Jackson D, Juni BA, Lynfield R, Jorgensen JH, Carvalho M da G, Beall BW, Active Bacterial Core Surveillance Team . 2010. Increased penicillin nonsusceptibility of nonvaccine-serotype invasive pneumococci other than serotypes 19A and 6A in post-7-valent conjugate vaccine era. J Infect Dis 201:770–775. doi: 10.1086/650496 [DOI] [PubMed] [Google Scholar]
  • 25. Lo SW, Mellor K, Cohen R, Alonso AR, Belman S, Kumar N, Hawkins PA, Gladstone RA, von Gottberg A, Veeraraghavan B, et al. 2022. Emergence of a multidrug-resistant and virulent Streptococcus pneumoniae lineage mediates serotype replacement after PCV13: an international whole-genome sequencing study. Lancet Microbe 3:e735–e743. doi: 10.1016/S2666-5247(22)00158-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Knight JR, Dunne EM, Mulholland EK, Saha S, Satzke C, Tothpal A, Weinberger DM. 2021. Determining the serotype composition of mixed samples of pneumococcus using whole-genome sequencing. Microb Genom 7:1–10. doi: 10.1099/mgen.0.000494 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Kapatai G, Sheppard CL, Al-Shahib A, Litt DJ, Underwood AP, Harrison TG, Fry NK. 2016. Whole genome sequencing of Streptococcus pneumoniae: development, evaluation and verification of targets for serogroup and serotype prediction using an automated pipeline. PeerJ 4:e2477. doi: 10.7717/peerj.2477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. van Tonder AJ, Bray JE, Jolley KA, Jansen van Rensburg M, Quirk SJ, Haraldsson G, Maiden MCJ, Bentley SD, Haraldsson Á, Erlendsdóttir H, Kristinsson KG, Brueggemann AB. 2019. Genomic analyses of >3,100 nasopharyngeal pneumococci revealed significant differences between pneumococci recovered in four different geographical regions. Front Microbiol 10:317. doi: 10.3389/fmicb.2019.00317 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Penkova E, Raymond B. 2024. When does antimicrobial resistance increase bacterial fitness? Effects of dosing, social interactions, and frequency dependence on the benefits of AmpC β-lactamases in broth, biofilms, and a gut infection model. Evol Lett 8:587–599. doi: 10.1093/evlett/qrae015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Slater FR, Bailey MJ, Tett AJ, Turner SL. 2008. Progress towards understanding the fate of plasmids in bacterial communities. FEMS Microbiol Ecol 66:3–13. doi: 10.1111/j.1574-6941.2008.00505.x [DOI] [PubMed] [Google Scholar]
  • 31. Croucher NJ, Finkelstein JA, Pelton SI, Mitchell PK, Lee GM, Parkhill J, Bentley SD, Hanage WP, Lipsitch M. 2013. Population genomics of post-vaccine changes in pneumococcal epidemiology. Nat Genet 45:656–663. doi: 10.1038/ng.2625 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Mitchell PK, Azarian T, Croucher NJ, Callendrello A, Thompson CM, Pelton SI, Lipsitch M, Hanage WP. 2019. Population genomics of pneumococcal carriage in Massachusetts children following introduction of PCV-13. Microb Genom 5:e000252. doi: 10.1099/mgen.0.000252 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Millar EV, O’Brien KL, Zell ER, Bronsdon MA, Reid R, Santosham M. 2009. Nasopharyngeal carriage of Streptococcus pneumoniae in Navajo and White Mountain Apache children before the introduction of pneumococcal conjugate vaccine. Pediatr Infect Dis J 28:711–716. doi: 10.1097/INF.0b013e3181a06303 [DOI] [PubMed] [Google Scholar]
  • 34. Grant LR, Hammitt LL, O’Brien SE, Jacobs MR, Donaldson C, Weatherholtz RC, Reid R, Santosham M, O’Brien KL. 2016. Impact of the 13-valent pneumococcal conjugate vaccine on pneumococcal carriage among American Indians. Pediatr Infect Dis J 35:907–914. doi: 10.1097/INF.0000000000001207 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. O’Brien KL, Moulton LH, Reid R, Weatherholtz R, Oski J, Brown L, Kumar G, Parkinson A, Hu D, Hackell J, Chang I, Kohberger R, Siber G, Santosham M. 2003. Efficacy and safety of seven-valent conjugate pneumococcal vaccine in American Indian children: group randomised trial. Lancet 362:355–361. doi: 10.1016/S0140-6736(03)14022-6 [DOI] [PubMed] [Google Scholar]
  • 36. CDC . ABCs bact facts interactive data dashboard. Available from: https://www.cdc.gov/abcs/bact-facts-interactive-dashboard.html. Retrieved 19 Feb 2023. 19 Feb 2023
  • 37. Bajema KL, Gierke R, Farley MM, Schaffner W, Thomas A, Reingold AL, Harrison LH, Lynfield R, Burzlaff KE, Petit S, Barnes M, Torres S, Vagnone PMS, Beall B, Pilishvili T. 2022. Impact of pneumococcal conjugate vaccines on antibiotic-nonsusceptible invasive pneumococcal disease in the United States. J Infect Dis 226:342–351. doi: 10.1093/infdis/jiac154 [DOI] [PubMed] [Google Scholar]
  • 38. Metcalf BJ, Gertz RE Jr, Gladstone RA, Walker H, Sherwood LK, Jackson D, Li Z, Law C, Hawkins PA, Chochua S, Sheth M, Rayamajhi N, Bentley SD, Kim L, Whitney CG, McGee L, Beall B, Active Bacterial Core surveillance team . 2016. Strain features and distributions in pneumococci from children with invasive disease before and after 13-valent conjugate vaccine implementation in the USA. Clin Microbiol Infect 22:60. doi: 10.1016/j.cmi.2015.08.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Li Y, Metcalf BJ, Chochua S, Li Z, Walker H, Tran T, Hawkins PA, Gierke R, Pilishvili T, McGee L, Beall BW. 2019. Genome-wide association analyses of invasive pneumococcal isolates identify a missense bacterial mutation associated with meningitis. Nat Commun 10:1–11. doi: 10.1038/s41467-018-07997-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Beall B, Walker H, Tran T, Li Z, Varghese J, McGee L, Li Y, Metcalf BJ, Gierke R, Mosites E, Chochua S, Pilishvili T. 2021. Upsurge of conjugate vaccine serotype 4 invasive pneumococcal disease clusters among adults experiencing homelessness in California, Colorado, and New Mexico. J Infect Dis 223:1241–1249. doi: 10.1093/infdis/jiaa501 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455–477. doi: 10.1089/cmb.2012.0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069. doi: 10.1093/bioinformatics/btu153 [DOI] [PubMed] [Google Scholar]
  • 43. Wick RR, Judd LM, Gorrie CL, Holt KE. 2017. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 13:e1005595. doi: 10.1371/journal.pcbi.1005595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Mikheenko A, Valin G, Prjibelski A, Saveliev V, Gurevich A. 2016. Icarus: visualizer for de novo assembly evaluation. Bioinformatics 32:3321–3323. doi: 10.1093/bioinformatics/btw379 [DOI] [PubMed] [Google Scholar]
  • 45. Gurevich A, Saveliev V, Vyahhi N, Tesler G. 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29:1072–1075. doi: 10.1093/bioinformatics/btt086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J. 2015. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31:3691–3693. doi: 10.1093/bioinformatics/btv421 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental material. mbio.03355-23-s0001.docx.

Supplemental text, tables, and figures.

mbio.03355-23-s0001.docx (3.6MB, docx)
DOI: 10.1128/mbio.03355-23.SuF1

Data Availability Statement

Whole-genome sequencing data are publicly available before the initiation of the study in NCBI under BioProject number PRJEB2632 , PRJEB8327, and PRJNA284954. Accession numbers and accompanying metadata have previously been published. The list of NCBI accession numbers for used sequencing data and all R scripts to perform weighting and prediction can be found in GitHub Repository (https://github.com/c2-d2/Predicting_Pneumo_postPCV13).


Articles from mBio are provided here courtesy of American Society for Microbiology (ASM)

RESOURCES