ABSTRACT
Legionella pneumophila serogroup 1 sequence types (ST) 213 and 222, a single-locus variant of ST213, were first detected in the early 1990s in the Midwest United States (U.S.) and the late 1990s in the Northeast U.S. and Canada. Since 1992, these STs have increasingly been implicated in community-acquired sporadic and outbreak-associated Legionnaires’ disease (LD) cases. We were interested in understanding the change in LD frequency due to these STs and identifying genetic features that differentiate these STs from one another. For the geographic area examined here (Mountain West to Northeast) and over the study period (1992–2020), ST213/222-associated LD cases identified by the Centers for Disease Control and Prevention increased by 0.15 cases per year, with ST213/222-associated LD cases concentrated in four states: Michigan (26%), New York (18%), Minnesota (16%), and Ohio (10%). Additionally, between 2002 and 2021, ST222 caused at least five LD outbreaks in the U.S.; no known outbreaks due to ST213 occurred in the U.S. during this time. We compared the genomes of 230 ST213/222 isolates and found that the mean of the average nucleotide identity (ANI) within each ST was high (99.92% for ST222 and 99.92% for ST213), with a minimum between ST ANI of 99.50% and a maximum of 99.87%, indicating low genetic diversity within and between these STs. While genomic features were identified (e.g., plasmids and CRISPR-Cas systems), no association explained the increasing geographic distribution and prevalence of ST213 and ST222. Yet, we provide evidence of the expanded geographical distribution of ST213 and ST222 in the U.S.
IMPORTANCE
Since the 1990s, cases of Legionnaires’ disease (LD) attributed to a pair of closely related Legionella pneumophila variants, ST213 and ST222, have increased in the U.S. Furthermore, between 2002 and 2021, ST222 caused at least five outbreaks of LD in the U.S., while ST213 has not been linked to any U.S. outbreak. We wanted to understand how the rate of LD cases attributed to these variants has changed over time and compare the genetic features of the two variants. Between 1992 and 2020, we determined an increase of 0.15 LD cases ascribed to ST213/222 per year in the geographic region studied. Our research shows that these STs are spreading within the U.S., yet most of the cases occurred in four states: Michigan, New York, Minnesota, and Ohio. Additionally, we found little genetic diversity within and between these STs nor could specific genetic features explain their geographic spread.
KEYWORDS: genomics, sequence types, Legionnaires' disease
INTRODUCTION
Bacteria in the genus Legionella can cause severe pneumonia, called Legionnaires’ disease (LD). Legionella multiplies in manufactured water systems when favorable conditions exist, such as low disinfectant levels or tepid water temperatures. Approximately one-third of Legionella species is associated with clinical disease, and ~90% of reported LD cases are due to a single species: Legionella pneumophila (1, 2). Within L. pneumophila, several thousand genetic sequence types (ST), defined by combining allele numbers for seven specific loci into an allelic profile, are considered a proxy for genetic lineage (3–6). STs are assigned based on the European Society of Clinical Microbiology and Infectious Diseases Study Group for Legionella Infections (ESGLI) sequence-based typing (SBT) database. The U.S. Centers for Disease Control and Prevention (CDC) and the European Centre for Disease Prevention and Control (ECDC) reported an increasing incidence of LD over the past two decades (7, 8), with the causative STs remaining consistent over time.
LD cases can occur as part of an outbreak, related to other cases in space and time, or as sporadic events with no known link to other cases; the U.S. CDC defines a Legionella outbreak as two or more cases with exposure to the same source within 12 months (9, 10). Worldwide, five L. pneumophila STs (ST1, ST23, ST37, ST47, and ST62) cause almost half of sporadic LD cases (11), and these are also frequently identified in outbreaks (12–20). In a retrospective review, Kozak-Muiznieks et al. (17) described the five most common STs in the U.S. associated with both outbreaks and sporadic cases between 1982 and 2012 (ST1, ST35, ST36, ST37, and ST222). Thus, the most prevalent disease-causing STs in the U.S. differ from those most often associated with the disease globally. Kozak et al. (21) first noted the prevalence of ST213 and ST222 among U.S. clinical isolates and the restriction of their geographic distribution to the Northeast U.S. in 2009. In 2010, Tijet et al. (22) described the genetic clade of ST213 and ST222 as a recently emerged sequence type of interest for southern Ontario, Canada. ST213 and ST222 are assumed to be descended from the same founding genotype (21) as single-locus ST variants. Among the seven SBT loci, these two STs differ by only four non-synonymous nucleotide changes in neuA, resulting in two amino acid changes with a predicted neutral effect (Table S1). Both STs are clinically significant and have been the source of sporadic disease, while to date, only ST222 has been associated with LD outbreaks in the U.S. and Canada (17, 23–25). Notably, the retrospective analysis in Kozak-Muiznieks et al. (17) identified ST222 as the causative ST for a U.S.-related outbreak in 2002 in Vermont, 3 years before the 2005 Canadian outbreak in Scarborough, Ontario that was most likely the first ST222 LD outbreak identified in Canada (24).
The current study summarizes ST213 and ST222 case trends in the U.S. between 1992 and 2020 based on isolates and sequences submitted to the CDC, including a timeline for their emergence, and spread. We also analyzed their genomes to classify these STs phylogenetically. We examined genomic variation, including plasmids, CRISPR-Cas elements, and structural differences, as a possible basis for the increased detection of these STs. Despite identifying no obvious genetic factor associated with the observed increased prevalence, we demonstrate the increased identification of these STs within the U.S. among LD cases and their expanded geographical distribution.
RESULTS
Increased frequency of ST213/ST222 clinical isolates in the U.S. submitted to the CDC during the past 28 years
The first documented occurrence of LD in the U.S. due to ST213 occurred in 1992 in Ohio, while Pennsylvania had the first U.S. ST222-associated case in 1998 (17; Fig. 1A; Table S2). This timeline is consistent with the emergence of these STs in Ontario, Canada, in the 1990s (22). The frequency of LD due to ST213/222 remained relatively flat in the U.S. until a slight increase from baseline occurred in 2005 and 2006, followed by a spike in 2013 (n = 35 cases) and subsequent steady growth from 2014 through 2019 with 37 cases in 2019 (Fig. 1A). Overall, cases of LD were low in 2020 and 2021, likely due to factors related to the COVID-19 pandemic. Using linear regression to investigate the relationship between ST213/222 case count and year, we observed a significant relationship between these two variables (P = 4.2e−10), with an average increase of 0.15 cases per year observed between 1992 and 2020 (Fig. 1A) and provide the projected case rate through 2030. We also ran a linear regression for the same two variables (case count and year) for the historically most prevalent ST worldwide, ST1 (15) using ST1 data from the CDC Legionella collection (Table S3). While there is a significant relationship between the variables (P = 2e−04), the regression coefficient indicates that the annual rate of ST1 increased by only 0.07 cases per year on average (Fig. 1B) during the same time and for isolates identified in the U.S. Using an ANOVA, we compared the slopes of the frequency trends for the ST213/222 vs ST1 and found that they were significantly different (P = 0.00121).
Fig 1.
The number of LD cases over time. (A) Rate of LD attributed to ST213/222 from 1992 through 2020 with rate prediction to 2030. For context, we labeled the year that the first ST213 and ST222 clinical isolates were identified in the U.S. and Europe (ST222 only). The inset graph provides a zoomed in look at the rate of cases through 2020. (B) Rate of LD cases attributed to ST1 from 1992 through 2020 with rate prediction to 2030. Note that the first ST1 clinical isolates in the U.S. were collected in 1982 from California and Indiana, before the specified time frame (1992–2020). The inset provides a zoomed in look at the rate of cases through 2020. (C) The total number of cases for ST213/222 from 1992 to 2020 parsed by state. Michigan, the state with the darkest color, indicates the largest number of LD cases attributed to ST213/222 clinical isolates compared with the other states. Only a single case due to ST213/ST222 was identified for three states (Delaware, Maine, and Missouri).
ST213 and ST222 cases spread throughout the Northeastern and Midwestern U.S. from their initial occurrence in Ohio and Pennsylvania, respectively. By 2021, they were identified as far south as Missouri and west as Oregon (Fig. 1C). The overall greatest concentration of cases attributed to ST213/222 occurred in four states: Michigan (26%), New York (18%), Minnesota (16%), and Ohio (10%). The total number of ST213/222 isolates from sporadic cases submitted to CDC (n = 188) for the entire timeline examined here (1992–2020) resulted in an essentially equivalent percent associated with either ST: ST213 (51%, n = 95) and ST222 (49%, n = 93). However, in 2019, a higher percentage of ST213 was documented as sporadic cases (68%, n = 25) than ST222 (32%, n = 12). Over time, ST222 has been increasingly implicated as the causative ST in LD outbreaks in the U.S. (17), and in total, ST222 was responsible for five documented outbreaks from 1992 to 2021. We include ST222 isolates from four outbreaks in this study, excluding the 2021 ST222 outbreak in Oregon because whole-genome sequence data are unavailable for that outbreak. Interestingly, only two of the four states with the highest occurrence of ST213/222 had a documented outbreak due to ST222 (MI in 2010 and OH in 2013). No known LD outbreaks were associated with ST213 during the same period (1992–2021; Table S2).
LD cases associated with ST213 and ST222 within the U.S. have occurred for at least ~30 years. However, we found no apparent association with geography (state or region) or time for the spread of these STs (Fig. 1A). Nor do geography and time explain the prevalence of ST222 in clinical outbreak cases of LD or the increased rate of ST213/ST222. In other words, there is no pattern of increased cases identified in a single state, by region, or over specific years, just a general increase over time. The outbreaks also showed no temporal pattern, as the first outbreak was in 2002, followed by an outbreak in 2005, then 2013, 2019, and 2021 (Table S2). The ST222 outbreak cases occurred in five states as far east as Maine and as far west as Oregon. A Fisher’s exact test between the two STs parsed on all unique outbreak, and sporadic isolates through 2021 indicates a trend (P = 0.059) for ST222. Thus, ST222 was more likely associated with outbreaks than ST213, as ST213 was not associated with an outbreak during this same time frame.
High average nucleotide similarity between ST213 and ST222
While it is unclear which ST is the true founder of the ST213/222 clonal complex, ST213 serves as a hub to which at least six other single-locus variants (SLV) connect, including ST222 (Fig. 2A). Two complex members (ST1742 and ST2497) are double-locus variants of ST213 but SLVs of ST222 (Fig. 2A). At the same time, ST276 and ST2519 are double-locus variants of ST213 with a more distant relation with ST222 (Fig. 2A). There is no documentation of any STs in this clonal complex in the U.S. before 1992. To confirm that ST213 and ST222, while distinct STs, are more closely related to each other than other STs, we built a single nucleotide polymorphism (SNP) phylogeny using the core alignment generated by SNIPPY from the top 10 STs associated with clinical cases from the CDC Legionella collection. After removing recombination, the phylogeny demonstrates that ST213 and ST222 form a monophyletic clade (Fig. 2B).
Fig 2.
Clonal complex and phylogenetic analysis. (A) All STs belonging to the ST213/222 clonal complex as determined by the ESGLI SBT database. STs in white with dashed lines indicate those not represented in the CDC Legionella collection and, thus, not available for inclusion in the other analyses presented here. (B) Maximum likelihood phylogeny built using RAxML after running SNIPPY to generate a core SNP alignment and Gubbins to remove recombination. The Toronto-20052 genome (NZ_CP012019, ST222) was used as the reference to infer polymorphic sites. The included STs are the 10 most prevalent clinical STs, with sequences available in the CDC Legionella collection. Isolate selection for each ST was random (Table S9). Additionally, we included single-locus variants of ST213 (ST289, n = 2) and ST222 (ST1742, n = 1) as they are part of the ST213/222 clonal complex but were not identified as being among the top 10 clinical STs. Each ST clade is color coded, and identifying ST labels matches the coloring of the isolates in the tree. The scale bar represents the number of nucleotide substitutions per site. Black circles on each node represent a bootstrap value of 90 or greater.
We see a distinct split between ST213 and ST222 when we generate de novo assembled genomes and use the subsequent average nucleotide identity (ANI)-derived distance values, to create a phylogeny that includes all isolates examined here (n = 230, Fig. 3). Additionally, the relationships among the single- or double-locus variants in the average nucleotide identity phylogeny are consistent with those displayed in the clonal complex analysis and the SNP phylogeny (Fig. 2). For example, C166-O is a single-locus variant of ST222 identified as ST1742 and groups accordingly in the clonal complex analysis (Fig. 2A) and in the ANI phylogeny (Fig. 3). At the same time, C214-S and C184-S are single-locus variants of ST213 identified as ST289, and these isolates group together in the clonal complex analysis and the ANI phylogeny (Fig. 2A and 3). Despite a relatively high degree of similarity, the variable branch lengths between the STs indicate a range of genetic variation exists both within and between the STs (Fig. 3). One potential clade of interest consists of ST289 (Fig. 3), which contains four isolates and appears to be on a distinct evolutionary trajectory from ST213. All cases attributed to ST289 are sporadic, and the first reported clinical case occurred in 2006. At the same time, ST227, with two isolates, is a single-locus variant of ST213; yet, these two isolates are not monophyletic. Furthermore, ST276 appears on a long branch (Fig. 3), suggesting undersampling of this ST’s genetic diversity (26, 27).
Fig 3.
Phylogenetic tree generated from ANI derived distance values, using the neighbor-joining algorithm. iTOL was used to visualize the phylogenetic tree with rings around the phylogeny indicative of specific metadata. Color-coded metadata follow this order starting with the innermost circle: sequence type, year, region, source type, CRISPR-Cas type, and plasmid type. If a plasmid or a CRISPR type was not identified in an isolate, then that region is blank. The tree scale represents ANI distance calculated as 100 – ANI%.
Despite the genetic variation described above, one of the more striking features of the ST213/ST222 clade is the genetic similarity within and between these STs. Pairwise ANI comparisons revealed that between ST213 and ST222 isolates, the lowest ANI value was 99.5%, and the highest ANI value was 99.87%. In contrast, there was an average ANI of 99.92% for both ST213 and ST222 when isolates of the same ST were compared to one another (Table S4), while the within average ANI for ST1 isolates (Table S3) was 99.82% (Table S5).
Additional genomic elements do not differentiate between ST222 and ST213
Plasmids were identified in nearly half of ST213 isolates (47/102), while only 10 out of 120 ST222 isolates contained any plasmid. Five different plasmids were identified in the ST213/ST222 clonal complex isolates (Fig. 3; Lens plasmid pLPL, C9_S plasmid unamed2, FFI102, pLELO, and Legionella sainthelensi LA01-117 plasmid pLA01-117_150 k). The Lens plasmid (pLPL), the most prevalent plasmid, occurred 50 times independent of ST. We identified the Lens plasmid (pLPL) in ST213 (n = 38/102), ST222 (n = 10/120), and ST289 (n = 2 out of 4) isolates. Interestingly, none of the outbreak-associated isolates examined here (n = 22 isolates from 4 outbreaks) contain any of the plasmids described above; instead, isolates with a plasmid were either clinical, sporadic isolates, or unknown source types (Fig. 3). Furthermore, we did not identify plasmids in isolates recovered from environmental samples (n = 19). However, our sample size of environmental isolates was smaller than the clinical isolates (ST213: n = 1/102; ST222: n = 18/120; Fig. 3). Finally, we identified a Legionella sainthelensi plasmid (pLA01-117_150 k) in eight ST213 isolates (Fig. 3) and one ST289 isolate.
Previous studies have identified CRISPR-Cas systems in the ST213/222 lineage (28, 29), and the three most prevalent are type I-C, I-F, or II-B. Cas2, a component of the type II-B system, is possibly required for L. pneumophila infection of its amoebal host, though only demonstrated in one strain of L. pneumophila (strain 130b [30, 31]). Thus, it is hypothesized that the CRISPR-Cas system could help protect L. pneumophila from lytic viruses (e.g., gokushoviruses) or to defend against a known genetic element (LME-1) that alters a pathogen’s fitness (32). Our analysis confirmed CRISPR-Cas among ST213 and ST222 isolates (ST213: n = 102/102; ST222: n = 117/120; Fig. 3). A few clinical ST222 isolates lacked identifiable CRISPR-Cas systems (n = 3/120; Fig. 3) while we detected a CRISPR array for these isolates, each lacked an identifiable cas gene. The most prevalent CRISPR-Cas system identified was the type I-C system in 102 out of 102 ST213 isolates and 117 out of 120 ST222 isolates. Some isolates had greater than one CRISPR-Cas type ST213: n = 15/102 and ST222: n = 15/120 (Table S2), with the most common co-occurrence between CAS-type I-C and CAS-type II-B (ST213: n = 6/102, ST222: n = 7/120, ST289: n = 3 out of 4). Deecker and Ensminger (33) showed that repopulation of depleted CRISPR arrays via an alternate CRISPR-Cas system could happen in organisms with co-occurring CRISPR-Cas systems. In the five ST222 isolates examined by Rao et al. (29), they observed a core set of 43 spacers along with variation in the number of spacers depending on the isolate, which is in-line with the isolates examined in this study with a mode of 43 spacers (min = 6 and max = 89). Interestingly, our data suggest that both vertical and horizontal transmissions have likely affected the presence of the type I-F CRISPR-Cas system. For example, we identified the presence of CRISPR-Cas type I-F in two (of the six) ST213 isolates collected from Minnesota for three consecutive years. These isolates do not have an ANI value reflective of clonality (ANI ≥99.99). Thus, this shared CRISPR-Cas system identified in non-clonal isolates collected over time suggests vertical transmission and maintenance of this system in these isolates. Additionally, an ST222 isolate with the same CRISPR-Cas system (type I-F), collected from Minnesota in 2018, suggests a possible horizontal transfer between ST213 and ST222. However, we cannot rule out the presence of type I-F CRISPR-Cas before the divergence of ST213 and ST222 since this CRISPR-Cas type was identified in ST222 isolates collected as early as 2013 in Michigan (Fig. 3). Additionally, we identified the type I-A system in two isolates, one each of ST213 and ST222.
Conserved synteny and few loci distinguish the sequence types ST213 and ST222
Exploration of large-scale genetic similarities between ST213 and ST222 based on four representative completely closed genomes confirms the high degree of genomic content conservation and synteny (Fig. 4A), as demonstrated in previous work (34). ST213 and ST222 isolates have three local collinear blocks (LCB) or regions of continuous shared sequence (Fig. 4A). ST213 isolates have two additional LCBs, which are indels between the two ST222 sequences that are less than 50 kb in total out of 3.5 mb (Fig. 4A) starting at 0 mb (C177, ST213) and 3.5 mb (C217, ST213). In contrast to other work (35, 36), genome rearrangements appear less prevalent for the four isolates analyzed in this study (Fig. 4A). For example, only a tiny region at position ~200 kb in C217 and at ~2,700 kb in the other genomes is affected by rearrangement (Fig. 4A). This rearrangement generates the split between the two prominent local collinear blocks. Finally, one approximately 10 mb region located at ~200 kb in NZ_CP012019.1 (ST222) has no homologs in the three other genomes (Fig. 4A).
Fig 4.
Whole-genome alignment for three Pac-Bio genomes. (A) Mauve-generated figure of the three Pac-Bio genomes (ST222, n = 1 and ST213, n = 2) along with the Toronto-20052 genome (NZ_CP012019.1, ST222) as the reference. The order of the alignment is as follows: C177 (ST213), C217 (ST213), C175 (ST222), and NZ_CP012019.1 (ST222). (B) The CGView plot was generated for the three Pac-Bio genomes, using the NZ_CP012019.1 genome (ST222) as the reference. The order of the plot is as follows: NZ_CP012019.1 (ST222, inner black ring), C177 (ST213, dark red), C217 (ST213, lighter red), and the outermost circle is C175 (ST222, dark blue). Percent identity is displayed for the three isolates, with bins of 100%, 90%, or 70% matching sequence identity, and was calculated based on the NZ_CP012019.1 genome (ST222).
Further investigation of genomic differences between ST213 and ST222 found variation in the presence/absence of some genes when comparing the three individual completely closed genomes to all genes from the reference genome (NZ_CP012019, ST222). The C217 (ST213) isolate contains additional genetic material at ~188 kb, absent from the other genomes (C177, ST213; C175, ST222; NZ_CP012019, ST222; Fig. 4B). Inference of the content of those regions occurred via Prokka (37) and based on the concordant proteins from the reference genome (NZ_CP012019, ST222), which was annotated using the NCBI Prokaryotic Genome Annotation Pipeline (38). This region includes CsrA (WP_061784083.1, Table S6), VirB4 (WP_061784085.1, Table S6), and seven other type IV secretion system proteins, along with two hypothetical proteins (Table S6). virB4 is considered the most evolutionary conserved component of the type IV secretion system (39–41), with a known function in pilus biogenesis, substrate transfer, and virulence (42, 43). Some regions are missing in the ST213 isolates compared to the ST222 isolates (i.e., at ~90 kb, ~3 kb in size; at ~950 kb, ~10 kb in size; Fig. 4B). These regions contain eight genes (Table S7) identified via Prokka (37) and based on the concordant proteins from the reference genome (NZ_CP012019, ST222), include five hypothetical proteins, a glycosyltransferase family 4 protein, a right-handed parallel beta-helix repeat-containing protein, and a methyltransferase domain-containing protein. Of note, WP_080453446.1 is identifiable in the ST213 isolates (Table S7); however, due to sequence divergence, it only aligns partly. Further examination of the hypothetical proteins using the CD-search interface available at the NCBI conserved domain database (44) revealed that one hypothetical protein has a predicted tetratricopeptide repeat (TRP; see Discussion for additional details).
Pan-genome analysis highlights genes for regional source attribution
For all pan-genome analyses, “core genes” are found in 100% of the isolates or subgroups, while “accessory genes” are not (Table S8). When parsing isolates by region, the Northeast region (NE: n = 61) had 2,796 core genes with 1,828 accessory genes. The Central region (C: n = 36) had 2,736 core genes with 1,286 accessory genes. The North Central region (NC: n = 90) had 2,669 core genes with 2,315 accessory genes (Fig. S3). The percent of core genes ranges from 54% to 68% (NE: Core = 60%, C: Core = 68%, and NC: Core = 54%), while the percent of singleton genes ranges from 10% to 18% (NE: Singleton = 15%, C: Singleton = 10%, and NC: Singleton = 18%). This genetic variation, either conserved or variable in the regional bins of interest, could serve as possible molecular targets for source attribution (Fig. S3). To illustrate the potential discovery of genes as the number of genomes assessed increases, we generated a rarefaction curve for the regional bins. Analysis of Northeast and Central regions indicates missing genes, as both curves (core genes and total genes or the pan-genome) do not reach a plateau. This lack of plateau means adding more isolates would increase accessory and possibly alter core genes identified. In contrast, the North Central region appears to have adequate sampling for characterizing genes for this region’s core, accessory, and ultimately the pan-genome (Fig. S3). Temporal comparisons of 5-year bins are detailed in the supplemental methods.
DISCUSSION
Here, we observed that the distribution of the ST213 and ST222 isolates in the U.S. has expanded from the states of Pennsylvania and Ohio, where they were initially detected in the 1990s, to Maine in the east, Missouri in the south, and Oregon in the west as of 2021. The North American Great Lakes region is centered around several sizeable interconnected freshwater lakes in the mid-east of North America. It includes eight U.S. states and the Canadian province of Ontario. These STs were first identified in Pennsylvania and Ohio, states in the Great Lakes region. Despite their geographic spread throughout the U.S., ST213/ST222-associated LD cases remain concentrated in Michigan, New York, Minnesota, and Ohio, all states of the Great Lakes region. However, because input data are limited to isolates available in the CDC Legionella collection of isolates, submitted to the CDC by public health partners or collected by the CDC, there may be a bias in geographic location investigated, as only states that submit isolates or samples will be included in these analyses. However, the CDC Legionella collection contains data from 48 U.S. states.
The mechanism of the ST222 spread over 4,000 km (from Pennsylvania to Oregon) in 28 years is unknown since Legionella is a water-born bacterium that is not transmitted from person to person (see exception:reference 45). The humid continental climate of the Great Lakes region may be favorable for these STs of L. pneumophila. However, ST222 isolates have also been detected in the much drier climate of Colorado since 2011 (Table S2). Interestingly, ST222 has also expanded its geographic distribution in Canada, moving from Ontario to the cooler, drier province of Alberta, where it caused an LD outbreak in Calgary in the late fall of 2012 (25). The relative humidity in Calgary was above average during the fall of 2012, so the expansion of the geographic distribution of this ST could be associated with the changes in humidity. Thus, future work should evaluate the role of temperature and humidity in ST213/222 distribution. Notably, the ESGLI database lists ST213 isolates identified in the Netherlands and the Czech Republic and ST222 isolates identified in Austria, the Czech Republic, Germany, and the United Kingdom. In addition, LD associated with ST213 has been reported in Australia (46). However, there is uncertainty about the origin of ST213/ST222-associated infections detected outside of North America since patient history and travel information are relatively scarce. However, if these STs indeed spread to the freshwater systems of Europe and Australia, the mechanism of their possible intercontinental dispersion should be investigated along with genetic relatedness for isolates found internationally.
One method for quantifying the level of relatedness between isolates is by estimating ANI. Traditionally, ANI values greater than 95% between two isolates indicate a single species (47); notably, ANI values for some L. pneumophila subspecies are below this threshold with a 90.5%–96.4% range (48). However, the cutoff for classification methods below species level (e.g., sequence types) depends on the organism of interest and the classification method. Here, we see high but distinct ANI values between the two STs analyzed, with the minimum ANI value of 99.5% and the highest ANI value of 99.87% between ST213 and ST222. In contrast, the average ANI values within the STs were high: 99.92% for ST213 and ST222, suggesting that ANI can differentiate between the two STs. When examining the ANI values for isolates collected during an outbreak (e.g., Ohio 2013 C165 and C219), the ANI value between those isolates was ≥99.99% (Table S4), indicating clonality between these isolates. While ANI measures nucleotide identity in all orthologous genes shared between two genomes, it does not strictly represent core genome evolutionary relatedness because orthologous genes can vary between comparisons (49) nor does it begin to address variation in accessory genes; thus, some non-conserved genetic regions may have been excluded from ANI estimates between ST213 and ST222.
As plasmids play a role in genomic diversification, examining their number and identity may explain why ST222 has thus far been more often associated with outbreaks than ST213 or why there have been increased clinical cases due to both STs. For ST213, 47 out of 102 (46%) isolates harbored any plasmid, and for ST222, 10 out of 120 (8%) isolates harbored any plasmid. The most prevalent plasmid in ST213/222 isolates is the Lens plasmid, pLPL, which encodes a small RNA plasmid factor (RocRp) associated with silencing natural transformation (50). An L. pneumophila strain (Lens) harboring this plasmid was the source of a large outbreak in northern France (51). For the STs examined here, 38 ST213 isolates out of 102 had the pLPL plasmid compared to 10 out of 120 ST222 isolates. Thus, genetic exchange via horizontal gene transfer (HGT) may be limited for these STs, though further molecular characterization is required. The identification of plasmids in multiple Legionella species has provided evidence supporting the possibility of inter-species gene flow (52–54). For example, the plasmid pLA01-117_150k was first reported in L. sainthelensi, and we identified it here in L. pneumophila isolates (ST213, n = 8 and ST289, n = 1). Those isolates with pLA01-117_150k do not form a monophyletic clade (Fig. 3), suggesting also possible intra-specific gene flow. Given the low frequency of plasmids in ST222 and the higher prevalence in ST213, one could speculate that these plasmids may not confer any outbreak-associated advantage.
Another mechanism that may limit HGT is the protection CRISPR-Cas provides to restrict DNA uptake from phage or mobile genetic elements. We identified 227 isolates with a CRISPR-Cas system (ST213: n = 102, ST222: n = 117, ST227: n = 2, ST276: n = 1, ST289: n = 4, and ST1742: n = 1). One set of isolates (n = 13) harbored the type I-F CRISPR-Cas system and the Lens plasmid pLPL. This combination suggests this CRISPR system is likely located on the endogenous pLPL plasmid, generally considered a non-integrative plasmid (50, 55). Additionally, we identified one isolate with the type I-F CRISPR-Cas system without the Lens plasmid pLPL (Fig. 3), which suggests this CRISPR-Cas system is on the chromosome rather than located extrachromosomally (33). Rao et al. (29) showed that type I-C cas genes, the most common CRISPR system in the isolates examined here (ST213: 102/102 and ST222: n = 117/120), are upregulated during post-exponential growth and are involved with targeted acquisition of spacers generating protection against foreign elements. Identification of this CRISPR system in most of our isolates suggests a potential barrier to the uptake and spread of mobile genetic elements. Thus, the type I-C system may limit genetic variation via HGT.
Among the three Pac-Bio genomes analyzed here (ST222: n = 1 and ST213: n = 2), we identified WP_061784204.1 (~955 kb) as containing a hypothetical tetratricopeptide repeat domain in the ST222 isolates (Table S7). TRP domains are present in numerous bacterial proteins with functions that facilitate interpreting the surrounding environment (56), including proteins utilized for gene regulation, flagellar motor function, chaperone activity, and virulence (57). L. pneumophila genomes encode multiple proteins with predicted tetratricopeptide repeat motifs. Newton et al. (58) demonstrated that one such protein enhanced the uptake of L. pneumophila into human host cells (macrophage-like cell line, THP-1, and the type II alveolar epithelial cell line, A549), suggesting that these TRP domain-containing proteins may act as virulence factors. This hypothesis was further supported by the demonstration that proteins with the TRP domain may function as virulence factors during the aquatic and environmental L. pneumophila life cycle by Bandyopadhyay et al. (59). Yet, given the unknown function of WP_061784204.1, further characterization is needed to determine if it provides an advantage during infection.
We have described an increased identification and expanded geographic range of ST213 and ST222 with accompanying genetic analysis. Our data suggest that both STs will likely persist in causing LD sporadic cases and possible outbreaks. Factors precipitating the increased identification of these STs remain unknown, with possibilities including an overall increase in LD, an increase in LD diagnosis due to more widely available and improved diagnostics and increased awareness and testing for LD, or proliferation of L. pneumophila related to environmental selection or competition. Furthermore, the basis for the more frequent ST222 association with outbreaks remains elusive, given that the two STs have very high ANI values, substantial genomic synteny, and many shared CRISPR-Cas systems. Future studies could evaluate the role of environmental or other genetic factors on the increase and persistence of these STs in the natural environment.
MATERIALS AND METHODS
The CDC Legionella isolate collection consists of isolates recovered from environmental samples or clinical specimens by the CDC Legionella Laboratory and clinical or environmental isolates directly submitted by public health laboratories. Specimens and isolates are voluntarily submitted and may represent sporadic or outbreak-associated cases. Below is a much reduced version of our methods; please see the supplemental methods in the supplemental material for additional information.
ST213/222 isolate selection
The data set comprised a total of 230 isolates, which included isolates sequenced by state partners and submitted to the CDC for analysis (n = 124, Table S2) and isolates that were part of the CDC Legionella collection and sequenced by the CDC Legionella Laboratory (n = 106, Table S2). The ST213/222 clonal complex contains all STs sharing six out of seven loci with at least one other complex member and descended from that sequence type found in the ESGLI SBT database. This study includes any isolate for which ST has been identified and is among the following 11 STs comprising the ST213/222 clonal complex: ST213 (n = 102), ST222 (n = 120), ST227 (n = 2), ST276 (n = 1), ST289 (n = 4), and ST1742 (n = 1). Isolates belonging to five STs that are part of the clonal complex (ST1601, ST2497, ST2517, ST2519, and ST2728) have not been identified by CDC or state partners; thus, we do not have genomic data for these STs. Sequences were derived from isolates collected from 1992 to early 2020, spanning the states from Colorado to Maine (Mountain West to Northeast; Table S2). We excluded isolates from rate, regional, and temporal analyses if they had no corresponding collection year (n = 8). We also excluded isolates from the rate analysis if they had no corresponding source type (n = 11). Note, this collection is not considered routine surveillance nor is it comprised an equal set of clinical and environmental isolates having more clinical (n = 200) than environmental isolates (n = 19) along with some unknown source types (n = 11).
ST1 isolate selection
The data set comprised a total of 214 isolates, which included isolates sequenced by state partners submitted to the CDC for analysis (n = 90, Table S3) and isolates that were part of the CDC Legionella collection sequenced by the CDC Legionella Laboratory (n = 124). We include ST1 counts of clinically associated cases for comparison but do not include any genomic data related to ST1. In selecting ST1 isolates, we required each to have a collection year between 1992 and 2020, a clinical source type, and a documented collection within a U.S. state.
Estimating the rate of LD per year with forecasting
We used a linear model to evaluate the annual rate of LD cases over time and predict the yearly future rate for ST213/222 and ST1, which exclude environmental source isolates for case counts. For years with zero ST213/222 cases (e.g., 1993–1995), we estimated the annual frequency by extending the range in each direction until a year with cases is included on each side (1992–1996) and then dividing that by the number of years in the range (i.e., number of cases: n = 2; the number of years: n = 5). The resulting annual frequency of cases was 0.4 per year, and we set this as the frequency for the entire range between 1992 and 1996. We repeated the same steps for 1998–2000, where there were two cases over 3 years, resulting in a rate of 0.67 ST213/222 cases for all 3 years in the range. We employed a linear model (y = mx + b) using the log value of the rates as the predicted, dependent variable (y), with time in years as the predictor, independent variable (x). Next, we exponentiated the prediction to quantify the expected rate for the future (2021–2030) and estimated the uncertainty using the prediction intervals. We repeated these steps for ST1 (n = 214, Table S3). Before fitting our regression model to time series data for ST213/222 and ST1, we evaluated the residuals to confirm that our assumptions for the model had been met. From the residual diagnostics, we demonstrated an approximate center of zero for the histogram of residuals and that there were no lags outside of the 95% threshold for the autocorrelation function using the checkresiduals() function from the forecast package v8.17.0 (60) (Fig. S1 and S2).
Sequencing and genome assembly
Isolates sequenced at CDC (n = 106) were extracted using the Epicentre Masterpure DNA and RNA Purification Kit (cat. No. MCD85201), per the manufacturer’s instructions and sequenced on the Illumina MiSeq. State public health partners used their routine DNA extraction and sequencing protocols and then shared sequences with the CDC (n = 124). Standard bioinformatic analyses included trimming reads via Trimmomatic v0.39 (61) and SPAdes v3.13.0 (62) for de novo genome assembly for all isolates.
Sequence types, clonal complex analysis, and phylogenetics
We determined ST bioinformatically for all isolates using the whole-genome sequencing data and the el_gato program (https://github.com/appliedbinf/el_gato). PHYLOViZ v2.0 (63) created a minimum spanning tree of the 11 STs to visualize the ST213/222 clonal complex. To understand the relationship between ST213/222 and other STs, we produced a core single nucleotide polymorphism alignment via SNIPPY (64) using the NZ_CP012019.1 genome (ST222) as the reference. Next, we removed regions of recombination using Gubbins v3.0.0 (65). We inferred a maximum likelihood phylogenetic tree with a general time reversible evolutionary model, a discrete gamma-distribution to estimate heterogeneity across sites (GTRGAMMA), and 100 bootstrap replicates using RAxML v8.2.12 (66). The phylogeny was visualized and annotated in R v4.0.5 (67) using the ape package v5.6.2 (68) and ggtree v3.6.2 (69). In addition, we generated a distance measure for all ST213/222 isolates from all-against-all pairwise ANI values and constructed a neighbor-joining phylogenetic tree (70).
Plasmid and CRISPR-Cas system identification
We identified plasmids in our data using the program PIMA (https://github.com/appliedbinf/pima), which uses the RefSeq database (71) sequences between 1,000 and 500,000 bp with “plasmid” in the description. CRISPR-Cas Finder v4.2.20 (72) was used to identify CRISPR-Cas systems.
Gene annotation and pan-genome analysis
We annotated the genes using Prokka v1.14.5 (37), specifying a Legionella-specific BLAST database. We performed gene clustering and pan-genome analysis using Roary v3.11.2 (73), setting the -cd parameter to 100, which requires genes to be present in 100% of the isolates to be considered core, and we provide the presence/absence of those genes for all isolates in Table S8. Before pan-genome comparisons, we filtered to remove clonal outbreak isolates based on an ANI score of ≥99.99, keeping only one isolate per outbreak (Table S2). We compared isolates recovered from 2001 through 2020 within geographic and temporal subgroups. Additional information regarding subgroups and pan-genome analyses is provided in the supplemental methods and includes figures which summarize the information (Fig. S3 and S4).
Obtaining completely closed L. pneumophila genomes using Pacific Biosciences technology
We sequenced three isolates via Pacific Biosciences (Menlo Park, CA, USA): C175 (ST222), C177 (ST213), and C217 (ST213). We performed sequencing analysis using the hierarchical genome assembly process (HGAP) through SMRT Analysis version 2.3.0. To construct a completely closed L. pneumophila genome, we used the Pacific Biosciences HGAP3 assembler (74). Parameters for the HGAP3 assembler included an expected genome size of 3.4 mb and 15× target genome coverage. CGView was used to compare the PacBio completed whole-genome assemblies (75), and percent identity was based on the Toronto-20052 genome assembly (NZ_CP012019, ST222) to generate a comparative genomic map. To evaluate genomic synteny, we ran progressiveMauve (76).
ACKNOWLEDGMENTS
We thank public health professionals of Colorado, Indiana, Massachusetts, Michigan, Minnesota, New York, and Ohio for collecting and processing clinical and environmental samples and generating and sharing whole-genome sequences. We thank T. Johnson for generating sequences of the complete ST213 and ST222 genomes using Pacific Bioscience technology. We also thank S. S. Morrison and J. A. Caravas for their previous work on L. pneumophila genomes. We are grateful to the curators of the ESGLI database, especially B. Afshar, for providing data on the distribution of ST213 and ST222 isolates worldwide and assisting with allele and sequence type assignment for standard SBT runs. We thank I. Nemenman and J. L. Hite for thoughtful discussions regarding regression analyses and E. F. C. Scopel for R plotting finesse. Finally, we thank J. C. Smith, W. C. Edens, and A. L. Cohen for thoughtful discussions on the data presented here.
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
This study is subject to several limitations. Caution is warranted in interpreting these analyses due to the inherent bias in the specimens, isolates, and sequences available for analysis. Note that the clinical cases for the input data are limited to isolates available in the CDC Legionella collection for which the ST has been determined, more likely underestimating the prevalence of ST1 and ST213/222 and reflecting the recent overall increase in LD.
The use of trade names is for identification only. It does not constitute endorsement by the U.S. Department of Health and Human Services, the U.S. Public Health Service, or the Centers for Disease Control and Prevention. The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Contributor Information
Jennafer A. P. Hamlin, Email: jhamlin@cdc.gov.
Xiyang Dong, Third Institute of Oceanography Ministry of Natural Resources, Xiamen, China.
DATA AVAILABILITY
Illumina data sequenced in-house or from the state partners were submitted to the NCBI Sequence Read Archive (SRA) under BioProject PRJNA861313. NCBI BioProject PRJNA931318 contains the complete, PacBio-generated error-corrected genomes sequenced in this study. Isolates used in this analysis are available upon request. Code for this paper is available at https://github.com/jennahamlin/codeForManuscript-ST213_222.
SUPPLEMENTAL MATERIAL
The following material is available online at https://doi.org/10.1128/msphere.00756-23.
Predictive effect of mutations in neuAgene between ST213 and ST222.
Metadata for ST213/222 and ST1.
Average nucleotide identity for all ST213/S222 isolates and ST1 isolates.
Genes only found in C217 (ST213) compared to the two PacBio genomes (ST213 and ST222) and genes only found in C175 (ST222) compared to the two ST213 PacBio genomes.
Presence/absence matrix of genes generated by Roary, with genome annotations from Prokka for all genomes.
Metadata for isolates used to generate the ST phylogeny (Fig. 2a).
Presence/absence matrix of Toronto 20052 (NZ_CP012019.1) locus tags identified in all genomes analyzed.
Additional information regarding methods and results; legends for supplemental figures and tables.
Figures S1-S4.
ASM does not own the copyrights to Supplemental Material that may be linked to, or accessed through, an article. The authors have granted ASM a non-exclusive, world-wide license to publish the Supplemental Material files. Please contact the corresponding author directly for reuse.
REFERENCES
- 1. Fang GD, Yu VL, Vickers RM. 1989. Disease due to the Legionellaceae (other than Legionella pneumophila). Historical, microbiological, clinical, and epidemiological review. Medicine 68:116–132. doi: 10.1097/00005792-198903000-00005 [DOI] [PubMed] [Google Scholar]
- 2. Yu VL, Plouffe JF, Pastoris MC, Stout JE, Schousboe M, Widmer A, Summersgill J, File T, Heath CM, Paterson DL, Chereshsky A. 2002. Distribution of Legionella species and serogroups isolated by culture in patients with sporadic community‐acquired legionellosis: an international collaborative survey. J Infect Dis 186:127–128. doi: 10.1086/341087 [DOI] [PubMed] [Google Scholar]
- 3. Gaia V, Fry NK, Harrison TG, Peduzzi R. 2003. Sequence-based typing of Legionella pneumophila serogroup 1 offers the potential for true portability in legionellosis outbreak investigation. J Clin Microbiol 41:2932–2939. doi: 10.1128/JCM.41.7.2932-2939.2003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Gaia V, Fry NK, Afshar B, Lück PC, Meugnier H, Etienne J, Peduzzi R, Harrison TG. 2005. Consensus sequence-based scheme for epidemiological typing of clinical and environmental isolates of Legionella pneumophila. J Clin Microbiol 43:2047–2052. doi: 10.1128/JCM.43.5.2047-2052.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Ratzow S, Gaia V, Helbig JH, Fry NK, Lück PC. 2007. Addition of neuA, the gene encoding N-acylneuraminate cytidylyl transferase, increases the discriminatory ability of the consensus sequence-based scheme for typing Legionella pneumophila serogroup 1 strains. J Clin Microbiol 45:1965–1968. doi: 10.1128/JCM.00261-07 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Lück C, Fry NK, Helbig JH, Jarraud S, Harrison TG. 2013. Typing methods for Legionella. Methods Mol Biol 954:119–148. doi: 10.1007/978-1-62703-161-5_6 [DOI] [PubMed] [Google Scholar]
- 7. Barskey A, Lee S, Hannapel E, Smith J, Edens C. 2022. Legionnaires’ Disease Surveillance Summary Report, United States 2018-2019. Available from: https://www.cdc.gov/legionella/php/surveillance/?CDC_AAref_Val=https://www.cdc.gov/legionella/health-depts/surv-reporting/surveillance-reports.html
- 8. European Centre for Disease Prevention and Control . 2022. Legionnaires’ Disease - Annual epidemiological report for 2020. Available from: https://www.ecdc.europa.eu/en/publications-data/legionnaires-disease-annual-epidemiological-report-2020
- 9. Smith P, Moore M, Alexander N, Hicks L, O’Loughlin R. 2007. Surveillance for travel-associated Legionnaires disease - United States, 2005-2006. MMWR Morb Mortal Wkly Rep 56:1261–1263. [PubMed] [Google Scholar]
- 10. Centers for Disease Control and Prevention . 2021. Legionnaires disease outbreak considerations. Available from: https://www.cdc.gov/investigate-legionella/php/public-health-strategy/index.html. Retrieved 4 Aug 2024.
- 11. David S, Rusniok C, Mentasti M, Gomez-Valero L, Harris SR, Lechat P, Lees J, Ginevra C, Glaser P, Ma L, Bouchier C, Underwood A, Jarraud S, Harrison TG, Parkhill J, Buchrieser C. 2016. Multiple major disease-associated clones of Legionella pneumophila have emerged recently and independently. Genome Res 26:1555–1564. doi: 10.1101/gr.209536.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Rota MC, Pontrelli G, Scaturro M, Bella A, Bellomo AR, Trinito MO, Salmaso S, Ricci ML. 2005. Legionnaires’ disease outbreak in Rome, Italy. Epidemiol Infect 133:853–859. doi: 10.1017/S0950268805004115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Ginevra C, Forey F, Campèse C, Reyrolle M, Che D, Etienne J, Jarraud S. 2008. Lorraine strain of Legionella pneumophila serogroup 1, France. Emerg Infect Dis 14:673–675. doi: 10.3201/eid1404.070961 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Cazalet C, Jarraud S, Ghavi-Helm Y, Kunst F, Glaser P, Etienne J, Buchrieser C. 2008. Multigenome analysis identifies a worldwide distributed epidemic Legionella pneumophila clone that emerged within a highly diverse species. Genome Res 18:431–441. doi: 10.1101/gr.7229808 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Harrison TG, Afshar B, Doshi N, Fry NK, Lee JV. 2009. Distribution of Legionella pneumophila serogroups, monoclonal antibody subgroups and DNA sequence types in recent clinical and environmental isolates from England and Wales (2000-2008). Eur J Clin Microbiol Infect Dis 28:781–791. doi: 10.1007/s10096-009-0705-9 [DOI] [PubMed] [Google Scholar]
- 16. Vanaclocha H, Guiral S, Morera V, Calatayud MA, Castellanos M, Moya V, Jerez G, Gonzalez F. 2012. Preliminary report: outbreak of Legionnaires disease in a hotel in Calp, Spain, update on 22 February 2012. Euro Surveill 17:20093. doi: 10.2807/ese.17.08.20093-en [DOI] [PubMed] [Google Scholar]
- 17. Kozak-Muiznieks NA, Lucas CE, Brown E, Pondo T, Taylor TH, Frace M, Miskowski D, Winchell JM. 2014. Prevalence of sequence types among clinical and environmental isolates of Legionella pneumophila serogroup 1 in the United States from 1982 to 2012. J Clin Microbiol 52:201–211. doi: 10.1128/JCM.01973-13 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Sánchez-Busó L, Coscollà M, Palero F, Camaró ML, Gimeno A, Moreno P, Escribano I, López Perezagua MM, Colomina J, Vanaclocha H, González-Candelas F. 2015. Geographical and temporal structures of Legionella pneumophila sequence types in Comunitat Valenciana (Spain), 1998 to 2013. Appl Environ Microbiol 81:7106–7113. doi: 10.1128/AEM.02196-15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Lévesque S, Lalancette C, Bernard K, Pacheco AL, Dion R, Longtin J, Tremblay C. 2016. Molecular typing of Legionella pneumophila isolates in the province of Quebec from 2005 to 2015. PLoS ONE 11:e0163818. doi: 10.1371/journal.pone.0163818 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Ricci ML, Fillo S, Ciammaruconi A, Lista F, Ginevra C, Jarraud S, Girolamo A, Barbanti F, Rota MC, Lindsay D, Gorzynski J, Uldum SA, Baig S, Foti M, Petralito G, Torri S, Faccini M, Bonini M, Gentili G, Senatore S, Lamberti A, Carrico JA, Scaturro M. 2022. Genome analysis of Legionella pneumophila ST23 from various countries reveals highly similar strains. Life Sci Alliance 5:e202101117. doi: 10.26508/lsa.202101117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Kozak NA, Benson RF, Brown E, Alexander NT, Taylor TH, Shelton BG, Fields BS. 2009. Distribution of lag-1 alleles and sequence-based types among Legionella pneumophila serogroup 1 clinical and environmental isolates in the United States. J Clin Microbiol 47:2525–2535. doi: 10.1128/JCM.02410-08 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Tijet N, Tang P, Romilowych M, Duncan C, Ng V, Fisman DN, Jamieson F, Low DE, Guyard C. 2010. New endemic Legionella pneumophila serogroup I clones, Ontario, Canada. Emerg Infect Dis 16:447–454. doi: 10.3201/eid1603.081689 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Gilmour MW, Bernard K, Tracz DM, Olson AB, Corbett CR, Burdz T, Ng B, Wiebe D, Broukhanski G, Boleszczuk P, Tang P, Jamieson F, Van Domselaar G, Plummer FA, Berry JD. 2007. Molecular typing of a Legionella pneumophila outbreak in Ontario, Canada. J Med Microbiol 56:336–341. doi: 10.1099/jmm.0.46738-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Quinn C, Demirjian A, Watkins LF, Tomczyk S, Lucas C, Brown E, Kozak-Muiznieks N, Benitez A, Garrison LE, Kunz J, Brewer S, Eitniear S, DiOrio M. 2015. Legionnaires’ disease outbreak at a long-term care facility caused by a cooling tower using an automated disinfection system—Ohio, 2013. J Environ Health 78:8–13. [PubMed] [Google Scholar]
- 25. Knox NC, Weedmark KA, Conly J, Ensminger AW, Hosein FS, Drews SJ, Legionella Outbreak Investigative Team . 2017. Unusual Legionnaires’ outbreak in cool, dry western Canada: an investigation using genomic epidemiology. Epidemiol Infect 145:254–265. doi: 10.1017/S0950268816001965 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Wiens JJ. 2005. Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Syst Biol 54:731–742. doi: 10.1080/10635150500234583 [DOI] [PubMed] [Google Scholar]
- 27. Wiens JJ, Tiu J. 2012. Highly incomplete taxa can rescue phylogenetic analyses from the negative impacts of limited taxon sampling. PLoS ONE 7:e42925. doi: 10.1371/journal.pone.0042925 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. D’Auria G, Jiménez-Hernández N, Peris-Bondia F, Moya A, Latorre A. 2010. Legionella pneumophila pangenome reveals strain-specific virulence factors. BMC Genomics 11:181. doi: 10.1186/1471-2164-11-181 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Rao C, Guyard C, Pelaz C, Wasserscheid J, Bondy-Denomy J, Dewar K, Ensminger AW. 2016. Active and adaptive Legionella CRISPR-Cas reveals a recurrent challenge to the pathogen. Cell Microbiol 18:1319–1338. doi: 10.1111/cmi.12586 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Gunderson FF, Cianciotto NP. 2013. The CRISPR-associated gene cas2 of Legionella pneumophila is required for intracellular infection of amoebae. MBio 4:e00074-13. doi: 10.1128/mBio.00074-13 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Gunderson FF, Mallama CA, Fairbairn SG, Cianciotto NP. 2015. Nuclease activity of Legionella pneumophila Cas2 promotes intracellular infection of amoebal host cells. Infect Immun 83:1008–1018. doi: 10.1128/IAI.03102-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Deecker SR, Urbanus ML, Nicholson B, Ensminger AW. 2021. Legionella pneumophila CRISPR-Cas suggests recurrent encounters with one or more phages in the family Microviridae. Appl Environ Microbiol 87:e0046721. doi: 10.1128/AEM.00467-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Deecker SR, Ensminger AW. 2020. Type I-F CRISPR-Cas distribution and array dynamics in Legionella pneumophila. G3 10:1039–1050. doi: 10.1534/g3.119.400813 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Gomez-Valero L, Rusniok C, Jarraud S, Vacherie B, Rouy Z, Barbe V, Medigue C, Etienne J, Buchrieser C. 2011. Extensive recombination events and horizontal gene transfer shaped the Legionella pneumophila genomes. BMC Genomics 12:536. doi: 10.1186/1471-2164-12-536 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Sánchez-Busó L, Comas I, Jorques G, González-Candelas F. 2014. Recombination drives genome evolution in outbreak-related Legionella pneumophila isolates. Nat Genet 46:1205–1211. doi: 10.1038/ng.3114 [DOI] [PubMed] [Google Scholar]
- 36. David S, Sánchez-Busó L, Harris SR, Marttinen P, Rusniok C, Buchrieser C, Harrison TG, Parkhill J. 2017. Dynamics and impact of homologous recombination on the evolution of Legionella pneumophila. PLoS Genet 13:e1006855. doi: 10.1371/journal.pgen.1006855 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069. doi: 10.1093/bioinformatics/btu153 [DOI] [PubMed] [Google Scholar]
- 38. Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J. 2016. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 44:6614–6624. doi: 10.1093/nar/gkw569 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Fullner KJ, Lara JC, Nester EW. 1996. Pilus assembly by Agrobacterium T-DNA transfer genes. Science 273:1107–1109. doi: 10.1126/science.273.5278.1107 [DOI] [PubMed] [Google Scholar]
- 40. Fernández-López R, Garcillán-Barcia MP, Revilla C, Lázaro M, Vielva L, de la Cruz F. 2006. Dynamics of the IncW genetic backbone imply general trends in conjugative plasmid evolution. FEMS Microbiol Rev 30:942–966. doi: 10.1111/j.1574-6976.2006.00042.x [DOI] [PubMed] [Google Scholar]
- 41. Wallden K, Rivera-Calzada A, Waksman G. 2010. Type IV secretion systems: versatility and diversity in function. Cell Microbiol 12:1203–1212. doi: 10.1111/j.1462-5822.2010.01499.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Jones AL, Shirasu K, Kado CI. 1994. The product of the virB4 gene of Agrobacterium tumefaciens promotes accumulation of VirB3 protein. J Bacteriol 176:5255–5261. doi: 10.1128/jb.176.17.5255-5261.1994 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Walldén K, Williams R, Yan J, Lian PW, Wang L, Thalassinos K, Orlova EV, Waksman G. 2012. Structure of the VirB4 ATPase, alone and bound to the core complex of a type IV secretion system. Proc Natl Acad Sci U S A 109:11348–11353. doi: 10.1073/pnas.1201428109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Marchler-Bauer A, Bryant SH. 2004. CD-Search: protein domain annotations on the fly. Nucleic Acids Res 32:W327–31. doi: 10.1093/nar/gkh454 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Correia AM, Ferreira JS, Borges V, Nunes A, Gomes B, Capucho R, Gonçalves J, Antunes DM, Almeida S, Mendes A, Guerreiro M, Sampaio DA, Vieira L, Machado J, Simões MJ, Gonçalves P, Gomes JP. 2016. Probable person-to-person transmission of Legionnaires’ disease. N Engl J Med 374:497–498. doi: 10.1056/NEJMc1505356 [DOI] [PubMed] [Google Scholar]
- 46. Buultjens AH, Chua KYL, Baines SL, Kwong J, Gao W, Cutcher Z, Adcock S, Ballard S, Schultz MB, Tomita T, Subasinghe N, Carter GP, Pidot SJ, Franklin L, Seemann T, Gonçalves Da Silva A, Howden BP, Stinear TP. 2017. A supervised statistical learning approach for accurate Legionella pneumophila source attribution during outbreaks. Appl Environ Microbiol 83:e01482-17. doi: 10.1128/AEM.01482-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Konstantinidis KT, Tiedje JM. 2005. Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci U S A 102:2567–2572. doi: 10.1073/pnas.0409727102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Kozak-Muiznieks NA, Morrison SS, Mercante JW, Ishaq MK, Johnson T, Caravas J, Lucas CE, Brown E, Raphael BH, Winchell JM. 2018. Comparative genome analysis reveals a complex population structure of Legionella pneumophila subspecies. Infect Genet Evol 59:172–185. doi: 10.1016/j.meegid.2018.02.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. 2018. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9:5114. doi: 10.1038/s41467-018-07641-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Durieux I, Ginevra C, Attaiech L, Picq K, Juan P-A, Jarraud S, Charpentier X. 2019. Diverse conjugative elements silence natural transformation in Legionella species. Proc Natl Acad Sci U S A 116:18613–18618. doi: 10.1073/pnas.1909374116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. NhuNguyen T, Ilef D, Jarraud S, Rouil L, Campese C, Che D, Haeghebaert S, Ganiayre F, Marcel F, Etienne J, Desenclos J. 2006. A community‐wide outbreak of Legionnaires disease linked to industrial cooling towers—how far can contaminated aerosols spread? J Infect Dis 193:102–111. doi: 10.1086/498575 [DOI] [PubMed] [Google Scholar]
- 52. Bacigalupe R, Lindsay D, Edwards G, Fitzgerald JR. 2017. Population genomics of Legionella longbeachae and hidden complexities of infection source attribution. Emerg Infect Dis 23:750–757. doi: 10.3201/eid2305.161165 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Slow S, Anderson T, Biggs P, Kennedy M, Murdoch D, Cree S. 2018. Complete genome sequence of Legionella sainthelensi isolated from a patient with Legionnaires’ disease. Genome Announc 6:e01588-17. doi: 10.1128/genomeA.01588-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Slow S, Anderson T, Murdoch DR, Bloomfield S, Winter D, Biggs PJ. 2022. Extensive epigenetic modification with large-scale chromosomal and plasmid recombination characterise the Legionella longbeachae serogroup 1 genome. Sci Rep 12:5810. doi: 10.1038/s41598-022-09721-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Cazalet C, Rusniok C, Brüggemann H, Zidane N, Magnier A, Ma L, Tichit M, Jarraud S, Bouchier C, Vandenesch F, Kunst F, Etienne J, Glaser P, Buchrieser C. 2004. Evidence in the Legionella pneumophila genome for exploitation of host cell functions and high genome plasticity. Nat Genet 36:1165–1173. doi: 10.1038/ng1447 [DOI] [PubMed] [Google Scholar]
- 56. Blatch GL, Lässle M. 1999. The tetratricopeptide repeat: a structural motif mediating protein-protein interactions. Bioessays 21:932–939. doi: [DOI] [PubMed] [Google Scholar]
- 57. Cerveny L, Straskova A, Dankova V, Hartlova A, Ceckova M, Staud F, Stulik J. 2013. Tetratricopeptide repeat motifs in the world of bacterial pathogens: role in virulence mechanisms. Infect Immun 81:629–635. doi: 10.1128/IAI.01035-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Newton HJ, Sansom FM, Bennett-Wood V, Hartland EL. 2006. Identification of Legionella pneumophila-specific genes by genomic subtractive hybridization with Legionella micdadei and identification of lpnE, a gene required for efficient host cell entry. Infect Immun 74:1683–1691. doi: 10.1128/IAI.74.3.1683-1691.2006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Bandyopadhyay P, Sumer EU, Jayakumar D, Liu S, Xiao H, Steinman HM. 2012. Implication of proteins containing tetratricopeptide repeats in conditional virulence phenotypes of Legionella pneumophila . J Bacteriol 194:3579–3588. doi: 10.1128/JB.00399-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Hyndman RJ, Khandakar Y. 2008. Automatic time series forecasting: the forecast package for R. J Stat Softw 27:1–22. doi: 10.18637/jss.v027.i03 [DOI] [Google Scholar]
- 61. Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. doi: 10.1093/bioinformatics/btu170 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. 2020. Using SPAdes de novo assembler. CP Bioinformatics 70:e102. doi: 10.1002/cpbi.102 [DOI] [PubMed] [Google Scholar]
- 63. Nascimento M, Sousa A, Ramirez M, Francisco AP, Carriço JA, Vaz C. 2017. PHYLOViZ 2.0: providing scalable data integration and visualization for multiple phylogenetic inference methods. Bioinformatics 33:128–129. doi: 10.1093/bioinformatics/btw582 [DOI] [PubMed] [Google Scholar]
- 64. Seemann T. 2015. SNIPPY: fast bacterial variant calling from NGS reads. https://github.com/tseemann/snippy.
- 65. Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA, Bentley SD, Parkhill J, Harris SR. 2015. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res 43:e15. doi: 10.1093/nar/gku1196 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313. doi: 10.1093/bioinformatics/btu033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67. R Core Team . 2021. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available from: https://www.R-project.org/ [Google Scholar]
- 68. Paradis E, Schliep K. 2019. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35:526–528. doi: 10.1093/bioinformatics/bty633 [DOI] [PubMed] [Google Scholar]
- 69. Yu G, Smith DK, Zhu H, Guan Y, Lam T-Y. 2017. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol Evol 8:28–36. doi: 10.1111/2041-210X.12628 [DOI] [Google Scholar]
- 70. Saitou N, Nei M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425. doi: 10.1093/oxfordjournals.molbev.a040454 [DOI] [PubMed] [Google Scholar]
- 71. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44:D733–D745. doi: 10.1093/nar/gkv1189 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Couvin D, Bernheim A, Toffano-Nioche C, Touchon M, Michalik J, Néron B, Rocha EPC, Vergnaud G, Gautheret D, Pourcel C. 2018. CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res 46:W246–W251. doi: 10.1093/nar/gky425 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J. 2015. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31:3691–3693. doi: 10.1093/bioinformatics/btv421 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74. Chin C-S, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, Clum A, Copeland A, Huddleston J, Eichler EE, Turner SW, Korlach J. 2013. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods 10:563–569. doi: 10.1038/nmeth.2474 [DOI] [PubMed] [Google Scholar]
- 75. Stothard P, Wishart DS. 2005. Circular genome visualization and exploration using CGView. Bioinformatics 21:537–539. doi: 10.1093/bioinformatics/bti054 [DOI] [PubMed] [Google Scholar]
- 76. Darling AE, Mau B, Perna NT. 2010. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE 5:e11147. doi: 10.1371/journal.pone.0011147 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Predictive effect of mutations in neuAgene between ST213 and ST222.
Metadata for ST213/222 and ST1.
Average nucleotide identity for all ST213/S222 isolates and ST1 isolates.
Genes only found in C217 (ST213) compared to the two PacBio genomes (ST213 and ST222) and genes only found in C175 (ST222) compared to the two ST213 PacBio genomes.
Presence/absence matrix of genes generated by Roary, with genome annotations from Prokka for all genomes.
Metadata for isolates used to generate the ST phylogeny (Fig. 2a).
Presence/absence matrix of Toronto 20052 (NZ_CP012019.1) locus tags identified in all genomes analyzed.
Additional information regarding methods and results; legends for supplemental figures and tables.
Figures S1-S4.
Data Availability Statement
Illumina data sequenced in-house or from the state partners were submitted to the NCBI Sequence Read Archive (SRA) under BioProject PRJNA861313. NCBI BioProject PRJNA931318 contains the complete, PacBio-generated error-corrected genomes sequenced in this study. Isolates used in this analysis are available upon request. Code for this paper is available at https://github.com/jennahamlin/codeForManuscript-ST213_222.




