Skip to main content
Journal of Clinical Microbiology logoLink to Journal of Clinical Microbiology
. 2021 Jan 21;59(2):e00967-20. doi: 10.1128/JCM.00967-20

A Bioinformatic Pipeline for Improved Genome Analysis and Clustering of Isolates during Outbreaks of Legionnaires' Disease

Wolfgang Haas a,, Pascal Lapierre a, Kimberlee A Musser a
Editor: John P Dekkerb
PMCID: PMC8111141  PMID: 33239371

Legionnaires’ disease, a severe lung infection caused by the bacterium Legionella pneumophila, occurs as single cases or in outbreaks that are actively tracked by public health departments. To determine the point source of an outbreak, clinical isolates need to be compared to environmental samples to find matching isolates.

KEYWORDS: Legionella pneumophila, next-generation sequencing

ABSTRACT

Legionnaires’ disease, a severe lung infection caused by the bacterium Legionella pneumophila, occurs as single cases or in outbreaks that are actively tracked by public health departments. To determine the point source of an outbreak, clinical isolates need to be compared to environmental samples to find matching isolates. One confounding factor is the genome plasticity of L. pneumophila, making an exact sequence comparison by whole-genome sequencing (WGS) challenging. Here, we present a WGS analysis pipeline, LegioCluster, that is designed to circumvent this problem by automatically selecting the best matching reference genome prior to mapping and variant calling. This approach reduces the number of false-positive variant calls, maximizes the fraction of all genomes that are being compared, and naturally clusters the isolates according to their reference strain. Isolates that are too distant from any genome in the database are added to the list of candidate references, thereby creating a new cluster. Short insertions or deletions are considered in addition to single-nucleotide polymorphisms for increased discriminatory power. This manuscript describes the use of this automated and “locked down” bioinformatic pipeline deployed at the New York State Department of Health’s Wadsworth Center for investigating relatedness between clinical and environmental isolates. A similar pipeline has not been widely available for use to support these critically important public health investigations.

INTRODUCTION

Legionnaires’ disease or legionellosis is a severe, sometimes fatal, pneumonia caused by the Gram-negative coccobacillus Legionella pneumophila. The bacteria are omnipresent in fresh or brackish water and have been isolated from cooling towers, potable water, water heaters, shower heads, faucets, and hot tubs (13). Dispersion to human subjects usually occurs through contaminated aerosols, while aspiration or human-to-human transmission has been rarely reported (24). Risk factors for the disease are older age, impaired immunity, smoking, and abnormal pulmonary function (3).

Because of the environmental source of the infection, cases of Legionnaires’ disease can occur in outbreaks that involve numerous patients in a relatively short period of time. During an outbreak investigation, public health departments collect environmental and clinical samples to identify the source of the contaminated water and to stop the outbreak. This requires fast and efficient means to compare a large number of bacterial isolates to find a match between the clinical and environmental isolates. Whole-genome sequencing (WGS) is frequently used for this purpose, with the DNA sequencing reads from the isolates in question being compared to a common reference genome to determine the number of single-nucleotide polymorphisms (SNPs). A phylogenetic tree or a SNP matrix and minimum spanning tree (MST) are then used to compare all isolates with each other. Isolates with ≤5 SNPs can be considered to be identical or closely related. Although SNP-based analyses have a very high discriminatory power (5), pinpointing the source of an outbreak can be challenging when an endemic strain is widely distributed within a densely populated area. For example, an investigation of an outbreak of Legionnaires’ disease that occurred in New York City in 2015 found that clinical and various environmental isolates differed by as little as 0 to 2 SNPs (4), suggesting that added discriminatory resolution is highly advantageous.

Several SNP-based pipelines have been published to analyze outbreak isolates from various species, including L. pneumophila (59). Most of these pipelines use only one genome as the mapping reference for all isolates. The advantage of this approach is that all isolates can be readily compared with each other. A disadvantage is that plasmids, bacteriophages, or other mobile genetic elements (MGEs) that are present in one strain but not the other are ignored in the analyses. In species with a high propensity for recombination, such as L. pneumophila, two distinct bacterial lineages would differ not only in the number of SNPs but also in the extent of genetic rearrangements, which can lead to incorrectly mapped reads and false-positive SNP calls (5, 8). One way to alleviate this problem is to select a closely related reference genome for mapping of each isolate. Ideally, this would consider the entire genome, including MGEs, optimize the mapping, and reduce the number of false-positive SNPs (5, 10). An added advantage is that each reference would be the nucleus for a cluster of isolates, such that two isolates in different clusters would be, by definition, not closely related. If an isolate was too distinct from any of the available references, it could be added to the candidate reference pool to form a new cluster with subsequent isolates (5).

Retrospective outbreak studies generally analyze all isolates in a single batch, followed by additional data-processing steps. For example, one study described masking a SNP position in all isolates because it failed the quality control requirements in one isolate (11). However, during an actual outbreak or for overall jurisdictional surveillance for which all isolated Legionella sp. strains are sequenced, samples generally become available at irregular intervals for a period of several weeks, months, or decades (for facilities with persistent Legionella problems). The continuous reanalysis and reissuing of updated reports to epidemiologists are impractical; therefore, an alternative is to forego any postprocessing steps and to mask only low-quality SNP calls in the genome in question. The outcome is a potential undercount of SNPs for a particular pair of genomes, but this is offset by stable data and a higher resolution across all isolates in one cluster. Quality control metrics, such as coverage and N50 values, can be used to readily identify and flag for resequencing those isolates whose WGS data do not meet minimum quality standards.

Here, we introduce the Legionella Clustering (LegioCluster) pipeline, which was designed to improve SNP-based genome comparisons in three ways. (i) Short insertion/deletion (indel) events are considered in addition to SNPs to increase the discriminatory power of the analysis. In this context, an indel event is defined as a single insertion or deletion of 1 to 100 bases under the assumption that they were inserted (deleted) during a single event. A mutation event (ME) is then either a single SNP or indel event. Larger indels are considered separately by the pipeline; a high percentage of unmapped reads indicates large insertions or plasmids that are absent from the reference (see below). A large region of unmapped bases in the reference sequence indicates one or more large deletions in the query. The pipeline outputs the percentage of unmapped query reads and the number of unmapped reference bases in the final report. (ii) A reference for each isolate is automatically selected from a number of candidate genomes. Instead of using the same reference genome for mapping of all isolates, a kmer-based method is used to select the best fitting reference prior to mapping and adding the isolate to the reference’s cluster. (iii) New reference genomes are added if at least one of two conditions is met, i.e., the number of MEs exceeds a threshold value, indicating that the isolate being analyzed and the nearest reference are too distant, or there is a high percentage of unmapped reads, which suggests the presence of genetic material in the isolate that is absent from the reference. Either one of those conditions indicates that no closely related reference can be found, resulting in the addition of the new isolate's genome to the pool of potential references and the formation of a new cluster centered on this isolate. Subsequent isolates can then be mapped to the new reference and added to the corresponding cluster.

A similar workflow was described in a study by David et al. (5) that compared various WGS-based typing schemes for Legionella. Although the authors acknowledged the advantages of SNP-based methods, they ultimately favored an extended multilocus sequence typing (MLST) approach due to its ease of use and backward compatibility with simpler MLST schemes (5). While those authors presented valid reasons for selecting MLST, we prefer a SNP- or ME-based approach described here due to its greater discriminatory power. This is important for outbreak analysis in a high incidence state such as New York, with numerous endemic strains in different metropolitan areas. An added bonus is that a SNP- or ME-based approach does not rely on a central database for type designations.

Using actual reads from the 2015 New York City outbreak and a 2018 study that investigated the long-term persistence of Legionella in health care facilities, we show that our new data analysis pipeline provides better resolution, which is especially important in regions in which strains are endemic. Being able to precisely compare clinical and environmental isolates with each other can facilitate epidemiological efforts to control future outbreaks. The pipeline described here automates the comparison and clustering of bacterial isolates and allows scientists with little training in bioinformatics to perform the data analysis. Although the workflow was initially developed for Legionella, it has since then been applied to numerous other bacterial species, including Acinetobacter baumannii, Escherichia coli, Pseudomonas aeruginosa, and Staphylococcus aureus.

MATERIALS AND METHODS

The LegioCluster pipeline.

Sequencing adapters and poor-quality sequences were removed from raw Illumina reads with Trimmomatic version 0.38 (12). Mash version 1.1 (13) was used to compare the reads to the genomes of 52 representative species to detect incorrect species assignments and gross contamination. The de novo assembly of reads into contigs was performed using SPAdes version 3.12.0 (14). All contigs were subject to a check for contamination using Minikraken version 1.1 (15). Contigs that were not of the expected genus were flagged for further analysis.

The contigs were compared to a number of different L. pneumophila strains with Mash to select the best reference genome for mapping. The genome with the highest Mash score was used for mapping with BWA version 0.7.17 and SAMtools version 1.9 (1618). Duplicate reads were flagged with Picard MarkDuplicates version 2.18.16 (http://broadinstitute.github.io/picard). In cases in which two (or more) genomes had similarly high Mash scores, reads were mapped to both genomes and the one with the highest percentage of mapped reads was selected for further analyses. FreeBayes version 1.0.2 (19), vcflib version 1.0.0_rc1 (https://github.com/vcflib/vcflib), and BCFtools mpileup version 1.9 (17) were used to call and to assess the quality of individual SNPs, as well as short indel events encompassing 1 to 100 bases. Jointly, these variations are referred to as MEs. Base positions that did not meet the quality thresholds for an ME were masked as “n” in the sequence alignment for that isolate, while unmapped bases were masked as “N.”

If the number of MEs was below 30,000 and the proportion of mapped reads was above 90.0%, then the isolate was placed into the same cluster as the reference genome. The threshold of 30,000 MEs was determined empirically for L. pneumophila based on isolates available at the Wadsworth Center from various previous outbreaks. The number represents approximately 0.895% of the L. pneumophila median genome size and correspondingly changes with the species.

A ME matrix was constructed (or updated) from all isolates within a cluster by pairwise comparing all reference-aligned sequences and counting MEs. Ambiguous or missing bases, masked as n or N, were ignored in each pairwise comparison. The updated ME matrix was used to generate a MST following Prim's algorithm (20).

The pipeline accounted for horizontal gene transfer and recombination in Legionella by adding a query genome to the list of candidate references and creating a new cluster if at least one of the following two conditions was met: (i) The ME count was above 30,000, which was considered an indication that the query was too distant from even the nearest reference for meaningful mapping and ME determination. (ii) The percentage of unmapped reads exceeded 10%, which was interpreted as a sign that the query genome contained significant quantities of DNA that were absent from the reference genome. If the new isolate's genome assembly passed all quality control checks for new candidate references, then it was made available as a potential reference for subsequent isolates.

Finally, a phylogenetic tree was generated using Parsnp version 1.2 (21) and the Newick Utilities version 1.6 (22). Parsnp uses the core genomes of all isolates within a cluster or, if a new reference was added, all candidate reference genomes.

As indicated above, reads, genome assemblies, and new references were subjected to several quality control steps that included FastQC version 0.11.8 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc), SAMtools version 1.9 (17), Qualimap version 2.2.2b (23), and Quast version 5.0.2 (24). Depending on severity, failing a quality control check resulted in either a warning to the user or an early termination of the analysis for the isolate in question.

Pipeline validation.

A custom Python script was used to introduce 55,555 unique MEs into the genome of Legionella pneumophila F4468. MEs included base substitutions (80% of all MEs), single-base indels (8% each), and indels in blocks of 10, 20, or 30 bp (2% each). Indels were surrounded by mutation-free regions to prevent two mutations from cancelling each other out. The MEs were spread over 25 FASTA files to simulate bacterial isolates containing 10, 100, 1,000, or 10,000 unique MEs, relative to strain F4468. The program ART_Illumina version 2.5.8 (25) was used to generate artificial, paired Illumina reads of 250-bp length. The artificial reads were processed by the LegioCluster pipeline using either 150,000, 250,000 or all 625,000 reads. Preliminary results showed that at least 150,000 reads were required to obtain minimally acceptable genome coverage, while 625,000 reads correspond to a theoretical coverage of 100-fold.

After all samples completed the pipeline, the VCF files created by FreeBayes were compared to the lists of known ME calls for each sample. A true-positive (TP) result is defined as a ME in the genome that was correctly identified by the pipeline. A false-positive (FP) result is a ME suggested by FreeBayes that is not based on an actual ME. A true-negative (TN) result is a base that is not mutated in the VCF file or in the input sequence, while a false-negative (FN) result is a ME in the input sequence that was not reported as such by the pipeline. The sensitivity, the proportion of actual positive results that were correctly identified, was calculated as follows: sensitivity = TP/(TP + FN). Specificity, the proportion of actual negative results that were correctly identified, was calculated as follows: specificity = TN/(TN + FP).

Data availability.

Raw Illumina reads were downloaded from the SRA database at NCBI under the BioProject accession numbers PRJNA345011 (New York City outbreak) and PRJNA483184 (Health Facility study). Genomes for the Legionella pneumophila reference strains Paris (CR628336.1), Philadelphia-1 (CP013742.1), and F4469 (CP014760) were downloaded from GenBank. The LegioCluster software (version 1.0.0) is available at GitHub (https://github.com/WHaasNY/LegioCluster).

RESULTS

The LegioCluster pipeline evolved out of a data analysis pipeline used previously at the Wadsworth Center to investigate outbreaks of Legionnaires’ disease. The new pipeline was designed to improve the resolution and accuracy of WGS-based isolate comparisons and clustering. Using read files from two earlier studies (4, 26), we demonstrate how LegioCluster works and we demonstrate the improvement over the previous pipeline.

Pipeline validation results.

Validating the pipeline was performed with various amounts of artificial reads created from simulated genome sequences that contained between 10 and 10,000 unique MEs (see Table S1 in the supplemental material). On the basis of 10,000 MEs and five independent experiments, the average sensitivity was 0.899, 0.965, and 0.971 when 150,000, 250,000, and 625,000 reads, respectively, were used. Standard deviations were 0.0027 or less. Specificity was 1.0 under all conditions. There were no false-positive ME calls after accounting for formatting differences between the list of MEs introduced into the simulated genome sequences and the VCF file created by the pipeline.

Addition of indel events to the SNP-based analysis.

While some investigations count only SNPs, the pipeline described here also considers indels to obtain a better resolution to separate two isolates. Since indels can affect several bases at once, any insertion or deletion of 1 to 100 bases was counted as one event. Larger regions are listed separately in the pipeline's report. The sum of SNPs and indel events was reported as the number of MEs.

To demonstrate the increased resolution of this approach, we reinvestigated L. pneumophila isolates collected during an outbreak in New York City in 2015. In the initial study, a SNP-based MST was constructed that revealed individual isolates as well as three clusters of 6, 7, and 46 isolates. The isolates within these clusters were indistinguishable from each other based on SNP counts alone (4). Considering indel events in addition to SNPs allowed us to further differentiate within the three clusters, such that the 6 isolates from the East Bronx cluster were broken into two subclusters, the 7 isolates that included isolate 09-214 were grouped into three subclusters, and the 46 isolates of the South Bronx cluster were grouped into six subclusters and a new cluster, as discussed below (Fig. 1).

FIG 1.

FIG 1

MSTs for L. pneumophila isolates from the 2015 outbreak in New York City. In a previous analysis relying on SNP differences alone, three clusters of isolates emerged whose members were indistinguishable from each other. Adding the number of indel events to the SNP count increased the resolution and allowed the formation of subclusters. One isolate, 15-190, became the nucleus for a new cluster. Numbers in red indicate the numbers of indel events.

Automated selection of reference strains.

Variant calling requires the comparison of the isolate of interest to a reference strain. By default, only those regions of DNA that are present in both genomes are subjected to the comparison. DNA that is unique to one of the two strains, such as a plasmid, is ignored. In addition, as the evolutionary distance between strains increases, so does the number of variants and with it the potential for false-positive calls. Therefore, it is preferable to select a reference strain that is as closely matched to the isolate as possible. To accomplish this in an automated fashion, the Mash program was used to compare kmers from the isolate's genome to kmers from a number of candidate reference genomes to select the most suited reference for mapping and ME calling.

Legionella isolates from three facilities that were analyzed in a 2018 study (26) were used to demonstrate the advantages of this approach. When all isolates were compared to a single reference (strain Philadelphia-1), the resulting MST was large and not easily interpretable. An edge label with 67,000 MEs indicated the presence of at least two subclusters (Fig. 2a). In contrast, selecting the best matching reference first and then performing the ME analysis resulted in three distinct clusters, i.e., Philadelphia-1, Paris, and 3_J (Fig. 2b). Isolate 3_J formed a new cluster, as described below.

FIG 2.

FIG 2

MSTs showing isolates from three health care facilities. Numbers indicate the number of MEs. The reference used for mapping and ME calling is shown in orange. (a) All isolates were mapped to reference strain Philadelphia-1 prior to variant calling. (b) Isolates were mapped to the best matching reference selected by the pipeline prior to variant calling. Isolate 3_J was initially mapped to reference strain Philadelphia-1 (dashed line) but subsequently became the nucleus for a new cluster. The name of the isolate that was most recently added to the MST is shown in a box at the top of the MST, while the reference strain for each cluster is shown in an orange box; all other isolates within the cluster are shown as oval nodes.

Selecting a better matching reference not only allowed for better visualization and naming of clusters but also decreased the number of MEs for almost all pairwise comparisons of isolates within the same cluster. For example, when the genomes of isolates 2_C and 2_D were aligned to reference strain Paris, they were separated by 1,806 and 1,807 MEs, respectively, from the reference strain and by only 1 ME from each other. In contrast, when strain Philadelphia-1 was used as the common reference, more than 80,000 MEs separated the two strains from the reference and 101 MEs separated the two isolates from each other. The 100 additional MEs were located in regions where the mapping of reads indicated large-scale genetic differences between the query and reference genomes, indicating that these 100 MEs are false-positive results (data not shown).

Addition of new references.

In order to account for isolates that are too distant from all reference genomes in the database or that contain large quantities of unmapped reads, the pipeline is able to add new genomes to the pool of candidate references so that they are available as references for subsequent isolates. Figure 2b shows an example in which isolate 3_J was mapped to reference Philadelphia-1, resulting in 6,373 MEs. However, only 87.8% of the 3_J reads could be mapped to the Philadelphia-1 genome, suggesting that the isolate possesses unique DNA that is absent from the reference. Therefore, the 3_J genome was placed in the pool of candidate references, and subsequent isolates (3_K through 3_N) were mapped to this new reference. A follow-up analysis revealed that strain Philadelphia-1 and the isolates in the 3_J cluster contained unique DNA fragments (data not shown), including a 52-kb plasmid that was present in 3_J but missing from Philadelphia-1.

Solving potential contamination issues.

One problem with WGS is that the high sensitivity of this method also amplifies foreign DNA that is introduced due to a mixed bacterial sample, contaminated equipment or reagents, or various other reasons. In most cases, low-level contamination results in short contigs with low coverage that can be easily removed in silico.

However, higher levels of contamination can pose a problem, as shown in Table 1; a sample from the New York City outbreak, isolate 15-190, was heavily contaminated with Pseudomonas DNA. Mash, which had been used to confirm the correct species, indicated an unusually large amount of Pseudomonas reads. In addition, the estimated genome length was unusually high. Since only 82% of the reads could be mapped to the reference strain, this sample was automatically designated a new candidate reference strain, placing this isolate into its own cluster and separating it from the others (Fig. 1).

TABLE 1.

Comparison between a typical sample (15-191) and a sample that was heavily contaminated (15-190)

Parameter Isolate 15-191 Isolate 15-190
Mash winner: species and no. of matching kmers/total no. of kmers Legionella pneumophila, 203/400 Legionella pneumophila, 86/400
Mash runner-up: species and no. of matching kmers/total no. of kmers Legionella micdadei, 4/400 Pseudomonas aeruginosa, 17/400
Genome length (bp) 3,401,715 3,758,008
Mapped reads (%) 99.99 81.77
Comment Added to existing cluster New cluster
Kraken/BLASTn finding No contamination 350 kb from Pseudomonas

A follow-up analysis with Kraken found that several contigs were more similar to Pseudomonas pseudoalcaligenes than to L. pneumophila, confirming the initial Mash results. This example demonstrates how a contaminated sample that managed to pass all quality checks was nevertheless separated from the other isolates by virtue of designating it a new reference and placing it in a new cluster. Had it been truly a new strain, other isolates might have been clustered with it, as shown for isolate 3_J.

DISCUSSION

WGS has replaced other methods as the tool of choice for outbreak investigations and surveillance because it provides much higher resolution and results are easily transferrable from one laboratory to another (3, 27). However, MGEs can cause problems in SNP- or ME-based analyses because genetic material that is present in one isolate but not another will not be part of the pairwise comparison. Recombination events can also introduce numerous MEs at once that are not subject to the molecular clock hypothesis (28). Even reshuffling of the genome can lead to the incorrect mapping of reads, leading to false-positive MEs. A seemingly simple solution to these problems is the selection of a reference strain that is as similar to the isolate under investigation as possible. Isolates with the same reference fall into the same cluster and are more likely to have undergone the same recombination events, to contain the same plasmids, and to be separated by only a few MEs. Isolates that are too distant from the reference, either by virtue of the number of MEs or because there are too many unmapped reads, fall into a different cluster. Reporting that an outbreak was caused by a strain that is similar to a known reference strain is more intuitive and flexible than a whole-genome MLST scheme that must be curated in a central repository (5, 8). Since public health laboratories primarily investigate local or statewide outbreaks and surveillance, the advantages of the method described here outweigh any benefits to having a central repository. As new lineages of outbreak strains emerge, the genomes of these new cluster-defining reference strains can be easily added to the database and shared among public health laboratories.

Here, we describe a data analysis pipeline, LegioCluster, that has been optimized for WGS data from bacterial species that are well known for horizontal gene transfer and recombination. Although the pipeline was initially developed to work with L. pneumophila data, it is species agnostic and can be easily expanded to work with any bacterial species that are of interest to a public health or clinical laboratory. The pipeline was validated by introducing between 10 and 10,000 MEs in silico into the genome sequence of strain F4468, generating artificial Illumina reads, processing those reads with the pipeline, and determining the sensitivity and specificity based on the resulting ME counts (see Table S1 in the supplemental material). The average sensitivity was 0.899, 0.965, and 0.971 when 150,000, 250,000 and 625,000 reads, respectively, were used. This indicates that, while more reads generally improve data quality, the added benefit diminishes at higher numbers of reads and might not warrant the extra cost, time, and computing demands. Specificity was 1.0 under all conditions. These sensitivity and specificity values are on par with or slightly better than those reported for the SNVPhyl pipeline (29). It should be noted that Freebayes outputs all candidate MEs and leaves it up to the user to apply quality filters to separate plausible mutations from the rest. Changing some or all quality filtering steps could increase sensitivity while at the same time increasing the potential for false-positive ME calls. While a visual inspection of the mapped reads is impractical (and unnecessary) in the case of hundreds or thousands of ME calls, it can be useful to remove any doubts about the validity of individual ME calls.

A key feature of this pipeline is that it selects the reference genome from various candidates that are best suited for mapping and then calls SNPs as well as short indel events. Indel events, defined here as the insertion or deletion of 1 to 100 bases, were considered in addition to SNPs to increase the discriminatory power of the analyses. An investigation of an outbreak of Legionnaires’ disease in New York City in 2015 showed that a SNP-based approach alone was unable to differentiate between several isolates, despite the high resolution of WGS analyses (3, 4). Adding the number of indel events to the SNP count provided the extra information needed to distinguish isolates that were otherwise considered to be identical (Fig. 1). While a single indel event might not appear significant, it can affect analyses in an outbreak investigation that involves an endemic strain that is widely distributed in a densely populated area (4). A high-resolution fingerprint of an outbreak isolate is important not only from a public health point of view but also from a legal perspective. Numerous cases related to Legionnaires’ disease are brought to court. Attributing a clinical isolate (case) to a specific source beyond a reasonable doubt requires methods that can distinguish even between closely related isolates.

A potential disadvantage of the inclusion of indels is that it can increase the number of false-positive variants, especially when the reference is distantly related and reads are incorrectly mapped due to previous genome rearrangements (5). However, this problem can be alleviated in two ways. (i) Counting events, rather than individual bases, not only reflects more accurately the acquisition of these variants, but also limits their overall contribution to the number of MEs. (ii) The selection of a closely matched reference also significantly reduces the potential for false-positive variant calls. The isolates shown in Fig. 2 were indirectly compared with each other through a common reference strain, which was either the Philadelphia-1 strain or the Paris strain. Using the 2_C and 2_D isolate pair as an example, the more distantly related reference Philadelphia-1 suggested 100 additional ME calls between the two isolates, compared to the more closely related reference strain Paris, which revealed only 1 ME. The difference in the ME count can be explained as the result of mapping errors following recombination events and differences in read coverage between two isolates (8). Variants due to incorrectly mapped reads can pass quality control filters if the coverage is sufficiently high, while the same position would be masked as ambiguous and ignored if the coverage was too low. Therefore, two isolates with otherwise identical sequences might have different numbers of ME calls because of differences in their coverage. The example in Fig. 2 shows that, by selecting a better matching reference, the propensity for false-positive ME calls can be greatly reduced.

A common practice has been to disregard SNPs that are located too close to each other because they were presumably acquired through horizontal gene transfer and therefore are not subject to the molecular clock hypothesis (11, 28). One problem with this approach is that it eliminates potentially important information. The other problem is that it requires masking of the same position in all isolates, which is easy to do in a retrospective study when all isolates are available. During an actual outbreak, however, when samples become available only over a period of weeks or months, each new batch of isolates would require the reanalysis of all isolates, which is clearly not practicable. Selecting an optimal reference genome from a database that can accommodate new references eliminates the need for masking since isolates with identical or similar genomes will be more likely to have undergone the same horizontal gene transfer events. The pipeline generates a plot of the distribution of ME calls relative to the reference genome to highlight ME hot spots that indicate possible recombination events. It is important to note that the threshold values that determine when an isolate becomes part of a new cluster need to be chosen with care. Generating too many clusters can pose the risk that two isolates in different clusters are more similar to each other than to isolates in the same cluster, while having too few clusters sacrifices resolution.

The high resolution of WGS analysis also means that contaminating DNA becomes more obvious. The pipeline uses several checks to alert the user to genomic material of the wrong species. The first check is done with Mash, which compares the reads to a number of genomes from known species and terminates the pipeline if the expected species cannot be confirmed. In the case of isolate 15-190 (Table 1), the degree of contamination was not sufficient to trigger a Mash-induced termination. However, the large amount of foreign DNA that remained unmapped resulted in the isolate being placed into its own new cluster, thereby effectively quarantining it from the remaining isolates. Contigs smaller than 1 kb and with coverage of less than 7.5-fold were removed from the isolate's genome sequence since our previous experience showed that those were frequently due to low-level contamination. In addition, the origin of each contig was confirmed with Kraken to alert the user to the presence of larger contamination issues.

While working with genome sequences from Enterobacteriaceae, it was noted that some plasmids can affect accurate analyses. The Kraken results can falsely indicate a contamination issue if a plasmid has been isolated from multiple species and a species other than the isolates’ is listed first in the Kraken output. In such cases, a BLAST search would usually show that the contig in question appears in various species. Another potential problem is due to high-copy-number plasmids that yield contigs with very high coverage. This can result in a large percentage of unmapped reads if the plasmid is absent from the reference. In these cases, follow-up studies are indicated to determine whether the isolate should form a new cluster.

Although the LegioCluster pipeline automatically selects the best fitting reference genome for mapping and ME calling or adds a new candidate reference as needed, there are situations in which the number of MEs alone does not tell the complete story. As noted above, Freebayes, which is used for SNP and indel calling, was never intended to call very large indels. Therefore, the number of MEs considers only indel events that are smaller than 100 bp. For this reason, the percentage of mapped reads and the percentage of mapped bases should also be considered to determine how much the isolate differs from the reference. If either metric is significantly below 100%, then programs like Mauve (30) can be used to align two or more genomes to look for DNA regions that are unique to one strain, as shown for the 3_J cluster and strain Philadelphia-1 (Fig. 2b).

The LegioCluster pipeline described here is currently being used at the New York State Department of Health to investigate isolates from outbreaks and routine surveillance of Legionella. A similar automated and “locked down” pipeline has not been widely available for use. This pipeline has been improved from an earlier available pipeline and has been designed to obtain the most accurate SNP and indel event data from a set of sequencing reads, especially for challenging bacterial genomes such as Legionella spp. Using two previous investigations as examples, we show here how the LegioCluster pipeline can improve the resolution of WGS-based analysis, reduce the number of false-positive ME calls, and result in clusters that take previous horizontal gene transfer events into account. Since new bacterial species can be easily added to the existing pipeline by providing the appropriate reference genome sequence, other pathogens have been tested as well. The pipeline has been used to investigate isolates from over a dozen different Gram-positive and Gram-negative bacterial species.

Supplementary Material

Supplemental file 1
JCM.00967-20-s0001.xlsx (10.3KB, xlsx)

ACKNOWLEDGMENTS

We thank Kay Nieselt for early input into the development of the pipeline, as well as the Wadsworth Center Advanced Genomic Technologies core for support for this testing. We also acknowledge the members of the New York State Department of Health and the Wadsworth Center who have contributed to all aspects of Legionella investigation and testing in New York.

The code development was partially supported by Public Health Emergency Preparedness grant U9OTP216988, funded by the Centers for Disease Control and Prevention.

Footnotes

Supplemental material is available online only.

REFERENCES

  • 1.Gomez-Valero L, Rusniok C, Buchrieser C. 2009. Legionella pneumophila: population genetics, phylogeny and genomics. Infect Genet Evol 9:727–739. doi: 10.1016/j.meegid.2009.05.004. [DOI] [PubMed] [Google Scholar]
  • 2.Raphael BH, Baker DJ, Nazarian E, Lapierre P, Bopp D, Kozak-Muiznieks NA, Morrison SS, Lucas CE, Mercante JW, Musser KA, Winchell JM. 2016. Genomic resolution of outbreak-associated Legionella pneumophila serogroup 1 isolates from New York State. Appl Environ Microbiol 82:3582–3590. doi: 10.1128/AEM.00362-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Weiss D, Boyd C, Rakeman JL, Greene SK, Fitzhenry R, McProud T, Musser K, Huang L, Kornblum J, Nazarian EJ, Fine AD, Braunstein SL, Kass D, Landman K, Lapierre P, Hughes S, Tran A, Taylor J, Baker D, Jones L, Kornstein L, Liu B, Perez R, Lucero DE, Peterson E, Benowitz I, Lee KF, Ngai S, Stripling M, Varma JK, South Bronx Legionnaires’ Disease Investigation Team. 2017. A large community outbreak of Legionnaires' disease associated with a cooling tower in New York City, 2015. Public Health Rep 132:241–250. doi: 10.1177/0033354916689620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lapierre P, Nazarian E, Zhu Y, Wroblewski D, Saylors A, Passaretti T, Hughes S, Tran A, Lin Y, Kornblum J, Morrison SS, Mercante JW, Fitzhenry R, Weiss D, Raphael BH, Varma JK, Zucker HA, Rakeman JL, Musser KA. 2017. Legionnaires' disease outbreak caused by endemic strain of Legionella pneumophila, New York, New York, USA, 2015. Emerg Infect Dis 23:1784–1791. doi: 10.3201/eid2311.170308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.David S, Mentasti M, Tewolde R, Aslett M, Harris SR, Afshar B, Underwood A, Fry NK, Parkhill J, Harrison TG. 2016. Evaluation of an optimal epidemiological typing scheme for Legionella pneumophila with whole-genome sequence data using validation guidelines. J Clin Microbiol 54:2135–2148. doi: 10.1128/JCM.00432-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ashton PM, Peters T, Ameh L, McAleer R, Petrie S, Nair S, Muscat I, de Pinna E, Dallman T. 2015. Whole genome sequencing for the retrospective investigation of an outbreak of Salmonella Typhimurium DT 8. PLoS Curr 7:ecurrents.outbreaks.2c05a47d292f376afc5a6fcdd8a7a3b6. doi: 10.1371/currents.outbreaks.2c05a47d292f376afc5a6fcdd8a7a3b6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Cunningham SA, Chia N, Jeraldo PR, Quest DJ, Johnson JA, Boxrud DJ, Taylor AJ, Chen J, Jenkins GD, Drucker TM, Nelson H, Patel R. 2017. Comparison of whole-genome sequencing methods for analysis of three methicillin-resistant Staphylococcus aureus outbreaks. J Clin Microbiol 55:1946–1953. doi: 10.1128/JCM.00029-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Saltykova A, Wuyts V, Mattheus W, Bertrand S, Roosens NHC, Marchal K, De Keersmaecker SCJ. 2018. Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i. PLoS One 13:e0192504. doi: 10.1371/journal.pone.0192504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wilson MR, Brown E, Keys C, Strain E, Luo Y, Muruvanda T, Grim C, Beaubrun JJG, Jarvis K, Ewing L, Gopinath G, Hanes D, Allard MW, Musser S. 2016. Whole genome DNA sequence analysis of Salmonella subspecies enterica serotype Tennessee obtained from related peanut butter foodborne outbreaks. PLoS One 11:e0146929. doi: 10.1371/journal.pone.0146929. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Usongo V, Berry C, Yousfi K, Doualla-Bell F, Labbé G, Johnson R, Fournier E, Nadon C, Goodridge L, Bekal S. 2018. Impact of the choice of reference genome on the ability of the core genome SNV methodology to distinguish strains of Salmonella enterica serovar Heidelberg. PLoS One 13:e0192233. doi: 10.1371/journal.pone.0192233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kaas RS, Leekitcharoenphon P, Aarestrup FM, Lund O. 2014. Solving the problem of comparing whole bacterial genomes across different sequencing platforms. PLoS One 9:e104984. doi: 10.1371/journal.pone.0104984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17:132. doi: 10.1186/s13059-016-0997-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AA, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455–477. doi: 10.1089/cmb.2012.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Wood DE, Salzberg SL. 2014. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15:R46. doi: 10.1186/gb-2014-15-3-r46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Li H, Durbin R. 2010. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595. doi: 10.1093/bioinformatics/btp698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Garrison E, Marth G. 2012. Haplotype-based variant detection from short-read sequencing. arXiv 1207.3907.
  • 20.Prim RC. 1957. Shortest connection networks and some generalizations. Bell Syst Tech J 36:1389–1401. doi: 10.1002/j.1538-7305.1957.tb01515.x. [DOI] [Google Scholar]
  • 21.Treangen TJ, Ondov BD, Koren S, Phillippy AM. 2014. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 15:524. doi: 10.1186/s13059-014-0524-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Junier T, Zdobnov EM. 2010. The Newick Utilities: high-throughput phylogenetic tree processing in the UNIX shell. Bioinformatics 26:1669–1670. doi: 10.1093/bioinformatics/btq243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.García-Alcalde F, Okonechnikov K, Carbonell J, Cruz LM, Götz S, Tarazona S, Dopazo J, Meyer TF, Conesa A. 2012. Qualimap: evaluating next-generation sequencing alignment data. Bioinformatics 28:2678–2679. doi: 10.1093/bioinformatics/bts503. [DOI] [PubMed] [Google Scholar]
  • 24.Gurevich A, Saveliev V, Vyahhi N, Tesler G. 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29:1072–1075. doi: 10.1093/bioinformatics/btt086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Huang W, Li L, Myers JR, Marth GT. 2012. ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594. doi: 10.1093/bioinformatics/btr708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Wells M, Lasek-Nesselquist E, Schoonmaker-Bopp D, Baker D, Thompson L, Wroblewski D, Nazarian E, Lapierre P, Musser KA. 2018. Insights into the long-term persistence of Legionella in facilities from whole-genome sequencing. Infect Genet Evol 65:200–209. doi: 10.1016/j.meegid.2018.07.040. [DOI] [PubMed] [Google Scholar]
  • 27.Salipante SJ, SenGupta DJ, Cummings LA, Land TA, Hoogestraat DR, Cookson BT. 2015. Application of whole-genome sequencing for bacterial strain typing in molecular epidemiology. J Clin Microbiol 53:1072–1079. doi: 10.1128/JCM.03385-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Kumar S. 2005. Molecular clocks: four decades of evolution. Nat Rev Genet 6:654–662. doi: 10.1038/nrg1659. [DOI] [PubMed] [Google Scholar]
  • 29.Petkau A, Mabon P, Sieffert C, Knox NC, Cabral J, Iskander M, Iskander M, Weedmark K, Zaheer R, Katz LS, Nadon C, Reimer A, Taboada E, Beiko RG, Hsiao W, Brinkman F, Graham M, Van Domselaar G. 2017. SNVPhyl: a single nucleotide variant phylogenomics pipeline for microbial genomic epidemiology. Microb Genom 3:e000116. doi: 10.1099/mgen.0.000116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Darling ACE, Mau B, Blattner FR, Perna NT. 2004. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14:1394–1403. doi: 10.1101/gr.2289704. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental file 1
JCM.00967-20-s0001.xlsx (10.3KB, xlsx)

Data Availability Statement

Raw Illumina reads were downloaded from the SRA database at NCBI under the BioProject accession numbers PRJNA345011 (New York City outbreak) and PRJNA483184 (Health Facility study). Genomes for the Legionella pneumophila reference strains Paris (CR628336.1), Philadelphia-1 (CP013742.1), and F4469 (CP014760) were downloaded from GenBank. The LegioCluster software (version 1.0.0) is available at GitHub (https://github.com/WHaasNY/LegioCluster).


Articles from Journal of Clinical Microbiology are provided here courtesy of American Society for Microbiology (ASM)

RESOURCES