With more than one million bacterial genome sequences uploaded to public databases in the last 25 years, genomics has become a powerful tool for studying bacterial biology. Here, we review recent approaches that leverage large numbers of whole genome sequences to decipher the spread and pathogenesis of bacterial infectious diseases.
It has been 25 years since the first genome of a free-living organism, H. influenzae Rd, was sequenced in its entirety [1]. With the significant reduction of sequencing costs, currently there are 1,067,277 bacterial genome sequences on the Sequence Read Archive (SRA) (https://www.ncbi.nlm.nih.gov/sra/) [2]. However, the archive is highly biased and incomplete. For instance, more than 72% of these genomes are from only 10 species, and only 41% of them have been assembled and submitted to GenBank [3] Table 1 (Accessed 01/27/2020). The dramatic increase in the number of genomes has already spurred the development of bioinformatic tools that can process huge amounts of data, and, most importantly, extract salient biological information. Of specific interest has been the utility of these massive genomic databases for understanding the spread of infectious diseases. Here we review the approaches and methods that have emerged to use whole genome sequencing (WGS) in understanding pathogenesis, outbreaks, and transmission of bacterial infections. Phylogenetic trees still occupy a central place in defining how genomes are related, but a growing set of comparative methodologies, both tree-based and tree-free, allow sophisticated genome comparison and functional prediction (see Box 1 for a non-exhaustive list of WGS tools). Here we highlight some recent instances when WGS techniques have addressed real-world problems in infection spread and pathogenesis. Although, comparative approaches have been used to study all kinds of infections including viruses, parasites, protozoa, and fungi, we will focus here on bacteria.
Table 1.
Bacterial Species with the most genomes in the SRA.
Organism | SRA | GenBank |
Salmonella enterica | 289,439 | 186,968 |
Escherichia coli | 131,451 | 22,635 |
Streptococcus pneumoniae | 75,553 | 21,481 |
Mycobacterium tuberculosis | 64,006 | 6,695 |
Staphylococcus aureus | 63,372 | 12,413 |
Campylobacter jejuni | 43,202 | 29,839 |
Listeria monocytogenes | 36,209 | 27,136 |
Streptococcus pyogenes | 27,336 | 2,717 |
Klebsiella pneumoniae | 25,294 | 9,756 |
Neisseria meningitidis | 21,557 | 1,974 |
Neisseria gonorrhoeae | 19,509 | 657 |
Enterococcus faecium | 15,950 | 1,871 |
Campylobacter coli | 14,206 | 11,935 |
Streptococcus agalactiae | 13,131 | 1,253 |
Clostridioides difficile | 13,092 | 2,472 |
Shigella sonnei | 12,333 | 1,822 |
Pseudomonas aeruginosa | 10,931 | 5,314 |
Campylobacter sp. | 8,075 | 150 |
Acinetobacter baumannii | 7,740 | 4,197 |
Shigella flexneri | 7,564 | 659 |
Box 1. Bioinformatic Techniques Used in Comparative Genomics.
Reference mapping and variant calling, annotation and visualization
BWA [4], Minimap2 [5], SAMtools [6], FreeBayes [7], GATK [8], Snippy [9], SnpEff [10], IGV [11]
Phylogenetic Inference
BEAST [12,13], RAxML [14], PhyML [15], MrBayes [16], TNT [17]
Recombination Detection
Tree Visualization
GraPhlAn [21], Figtree [22], SplitsTree [23], Dendroscope [24]
Integrated Data Online Visualization
iTOL [25], panX [26], Phandango [27], PATRIC [28], Microreact [29], Nextstrain [30]
Pangenome, GWAS and Ancestral Reconstruction
Roary [36], bugwas [37], Scoary [38], pyseer [39], Mesquite [40]
WhatsGNU [41]
Detection of Selection
Phage identification
Antimicrobial Resistance & Virulence
ResFinder [46], CARD [47], Mykrobe [48], ABRicate [49], VFDB [50]
Plasmid Analysis
Graph Genomes
Sketching for Pairwise Distances
Origin, Spread & Detection of Epidemics and Outbreaks
Perhaps the most straightforward application of WGSs has been in detecting the origin and spread of clonal outbreaks. Indeed, WGS techniques have emerged as a new gold standard that offers superior resolution and power for molecular epidemiology.
A recent study by Copin et al., used high-resolution WGS to investigate an apparent outbreak of MRSA in Brooklyn, New York, in an orthodox Jewish community [58]. Molecular typing showed that the isolates belonged to the widespread, epidemic community associated (CA)-MRSA USA300 clone, but could not distinguish these isolates from the ubiquitous background of this circulating strain. WGS phylogenetic comparison with other USA300 isolates from northern Manhattan and the Bronx showed that 93% of the isolates from patients living in the orthodox community in Brooklyn formed their own clade within USA300, strongly supporting the hypothesis of a new disseminating subclone (USA300-BKV) and a potential emerging public health threat [58].
In a nosocomial carbapenem-resistant K. pneumoniae (CRKP) outbreak in 2011 at the U.S. National Institutes of Health Clinical Center, Snitkin et al. [59] used WGS to investigate the spread among 17 patients beginning in the 3 weeks after the discharge of the index patient. A combined genomic and epidemiological approach linked the outbreak to three independent transmission events from the index patient [59]. A more recent study investigating the 2008 US regional CRKP outbreak that affected 26 health care facilities in 4 adjacent counties in Indiana and Illinois, showed the important role that interfacility patient sharing played in dissemination. WGS-based phylogenetic analysis enabled differentiation of intra- and interfacility transmission events showing that one of the facilities had three independent importation events with two subsequent intrafacility transmission events, rather than one importation and 4 intrafacility events [60].
Another recent study underlined the importance of environmental sampling for the identification of the source of a community outbreak of S. enterica associated with a buffet in a restaurant where WGS testing of raw food, fresh water, and food suppliers did not identify a clear source [61]. Thorough environmental sampling, however, showed that isolates from swabs from the sewer system had the same genomic profile as the outbreak isolates and grouped in the same phylogenetic clade. This study went on to identify an ineffective drainage system that acted as a bacterial reservoir for contaminated bio-aerosols. When the drainage system was remediated the outbreak ended [61].
Phylogenetic techniques also offer well-established techniques for inference of the geographic origins of outbreaks. A particularly clear example is the elucidation of the introduction of a V. cholera outbreak in Haiti by Nepalese U.N. aid workers, where WGS-based phylogenetic analysis of isolates from the Haitian outbreak suggested that they were more closely related to isolates from Nepal compared to other Western Hemisphere V. cholera isolates [62,63]. On the flip side, phylogenetic analysis of USA300 isolates from South America that had been thought to be an extension of the North American Epidemic of CA-MRSA, showed that isolates from the two regions diverged prior to the current epidemic, and represent two separate, parallel epidemics [64].
When there is not enough phylogenetic signal or robust enough sampling to establish the origin or direction of epidemic spread, other population genetic methods may offer alternatives to phylogeny for establishing the origins of epidemics. For instance, the concept of range expansion was recently used to infer the founding location of the USA300 clone in the US. This technique relies on the pattern produced from the multiple founder effects at the edge of an expanding pathogen front, which creates a pattern where diversity is lowest farthest away from the origin. By searching for the geographic location that maximized the slope of the diversity gradient, Challagundla et al. identified Pennsylvania as the most probable origin of USA300 [65].
While many outbreak WGS studies are retrospective out of necessity, WGS could offer a prospective strategy for detecting new emerging clones [66–68]. Real-time, prospective WGS of Listeria monocytogenes isolates collected from patients, food, and food processing environments has recently been used to uncover the origins of listeriosis outbreaks [66]. The Listeria whole genome sequencing project leveraged raw sequence data and metadata collected by multiple collaborating agencies (CDC, FDA, USDA-FSIS, amongst others) [66], and used whole genome MLST (wgMLST) phylogenies in several outbreak investigations [69–73]. In one outbreak, despite having different pulsed-field gel electrophoresis (PFGE) patterns, wgMLST of patient isolates showed they were highly similar and traceable to isolates from an ice cream producer. Overall, the initiative showed that the number of clusters detected and number of outbreaks “solved” increased by 1.5 and 4.5 times, respectively, by the second year of using WGS compared to pre-WGS technologies [66].
WGS analysis can also allow for tracking of epidemics at the most granular level, that is in specific transmission events between patients. Differences shared by the donor and recipient, but not present in other circulating strains, provide strong evidence for transmission, so it is critical that the rates of evolution are fast enough to provide new genetic changes that “mark” donor /recipient pairs that are discoverable using phylogenies or other network approaches. Such transmission networks, have been used in the tuberculosis field [74] among others [59,60] to determine the direction of transmission.
Patient-to-patient transmission has recently become controversial in the nontuberculous mycobacterium field, in which WGS reports of M. abscessus infection in cystic fibrosis patients suggested episodes of patient-to-patient transmission and worldwide dissemination of a single clone [75–78]. Subsequent reports have challenged the importance of person-to-person spread using WGS data [79–81]. While this controversy is unresolved, the analytical and theoretical hurdles have been instructive. First, it seems that there is unlikely to be a single cut-off value (eg., SNP differences) that can clearly identify which isolates are the same and different. A second critical issue is that comprehensive sampling of the environment and other patients, perhaps in other locations, is critical to “ruling in” transmission. The more genomes that do not cluster tightly with the putative transmitted isolate genomes, the more robust the study.
Another key parameter in studies of transmission, is the genomic diversity and heterogeneity of the pathogen in each host, and more than one isolate per case may need to be sequenced to avoid reconstruction of inaccurate events [82,83]. For instance, in the aforementioned study by Snitkin et al., WGSs of isolates from different body sites of the index patient was valuable for reconstructing three independent transmission events to 17 colonized patients [59]. Another case in point is a recent study [84] that re-examined isolates from a 2011–2012 TB outbreak in the Canadian Arctic [85]. The initial study had identified two subgroups of isolates that could be differentiated from each other by a single polymorphism, pointing to two distinct source patients [85]. Deeper sequencing of mixed communities (sweeps of multiple colonies from media plates) from a single patient sample identified that one source patient harbored genomes with both polymorphisms, leading to the possibility that this individual was a super-spreader, who likely transmitted to a third of the patients during the outbreak [84]. Another recent study used WGS to establish a direct link between probiotic use and Lactobacillus bacteremia in ICU patients. In this study, genomic heterogeneity of L. rhamnosus in blood isolates reflected the genomic heterogeneity in the probiotic capsules [86].
Inferring Function & Pathogenesis: Targeted Searches, Ancestral Reconstruction, and GWAS
The above techniques focus on defining relatedness, but WGS becomes even more powerful when we can predict biological function. For instance, WGS can characterize the antibiotic “resistome” of a strain by detecting known genes or variants associated with drug resistance. This is especially helpful in pathogens that are difficult to grow in a timely manner, which has been demonstrated in same-day tuberculosis diagnosis and antibiotic susceptibility prediction from culture-free respiratory samples [87]. Likewise, a “virulome”, or the set of all genes or variants encoding known virulence factors, might be used to determine the pathogenic potential of an organism [49,50]. For instance, isolates of Salmonella serovars from different sources (animal, human and environment) that share the same virulence genes might suggest the capacity of the animal and environmental isolates to cause infection in humans when transmitted [88]. However, preliminary studies of virulomes suggest that reliable interpretations may be complicated by diverse pathogen virulence profiles [89], and at least one study examining S. aureus bacteremia has shown no correlation between the virulome and clinical outcomes [90]. More clinical, longitudinal studies are needed to address the utility of this approach [91].
In a less directed way, WGSs can be used to highlight specific genes that may have contributed to the evolution of a disease. One goal is to find genes or genetic changes that are associated with critical evolutionary events, which is most commonly done within the framework of ancestral reconstruction on a phylogenetic tree [64]. Genomic features that were acquired on a branch that represents an emerging epidemic, and were maintained in many of the descendent genomes, are strong candidate loci that could encode critical biological traits that made an outbreak strain more fit. For an outbreak, fitness might be multifactorial, including enhanced transmission, persistence, virulence, or immune evasion so it is worthwhile considering genes with multiple functions. For instance, in the case of the clone USA300-BKV, an inactivating SNP in the transcriptional repressor of the pyrimidine nucleotide biosynthetic operon (pyrR) was likely important for commensal metabolic fitness [58]. In the parallel USA300 epidemics in North and South America, independent acquisition of copper detoxification loci may have led to increased survival upon copper challenge in the environment and in macrophages [92–94].
A key advantage of using WGSs in identifying genes involved in the emergence of new disease-causing strains is the ability to detect new additions to the genome. Horizontal gene transfer and the acquisition of mobile genetic elements appear to play a critical role. In the USA300-BKV clade, a prophage variant of ϕ11 was present in almost half of the isolates, and the presence of this phage was shown to enhance virulence in a murine skin model [58]. A recent study that investigated 109 isolates of outbreak-associated Clostridium perfringens from England and Wales over 7-year period showed that a specific enterotoxin producing clade (CPE) was responsible for 9 different food-poisoning outbreaks [95]. Surprisingly, although most C. perfringens enterotoxin had been thought to be chromosomally encoded, 83% of the outbreak strains carried enterotoxin-encoding (cpe) plasmids that had previously been thought to be relatively uncommon. Interestingly, the presence of the plasmid in phylogenetically distinct strains may indicate horizontal transfer by conjugation [95].
Another approach to ascertain function from WGSs is to use genome-wide association studies (GWAS). GWAS allows genetic features to be identified that are associated with some phenotype or clinical outcome. Sheppard et al. recently used GWAS in Campylobacter isolates to identify a genetic region for vitamin B5 biosynthesis that likely represents an adaptation to a diet of grasses in cattle [96]. Another large study by Levy et al. showed that genes for proteins involved in carbohydrate metabolism and transport are enriched in plant associated bacteria compared to non-plant associated genomes, and that a novel T6SS effector operon, involved in direct bacterial competition, was associated with the phytopathogenic bacteria of the genus Acidovorax [97]. In the S. aureus field, GWAS has been used to identify genomic loci associated with poor outcomes in bacteremia, but, importantly, the associations only held in some subdivisions (clonal complexes) of the species suggesting that specific genetic backgrounds have a large impact on pathogenicity, and arguing that phylogeny is critical for interpreting GWAS [98]. Indeed, a problem that has dogged GWAS studies in bacterial infectious disease is that phylogenetic relatedness of strains can strongly confound statistical inference of associations, however several techniques have recently become available that account for underlying phylogeny [99,100].
In-host Adaptation
Another way to detect functional properties important for pathogenesis or in-host persistence, is to identify adaptive changes that happen in each host. Evolution in situ provides strong evidence for the involvement of specific genes in host adaptation especially when seen repeatedly in different patients. A well-studied example of in-host adaptation is the development of “mucoidy”, or overproduction of alginate that impacts biofilm formation and antibiotic susceptibilities in strains of P. aeruginosa in cystic fibrosis [101]. While the development of mucoidy in different patients was described well before WGSs became widely available, whole genomes have clearly shown that this is an evolutionary step that often occurs by mutation and not by strain replacement. WGSs have also elucidated other parallel, predictable changes such as the loss of surface appendages and key regulatory networks [102]. A recent study by Riquelme et al. recapitulated many of these same observations, and also identified metabolic reprogramming as a major adaptational change for long-term P. aeruginosa infection [103]. In-host evolution has also been used to identify key genetic changes in other organisms such as S. aureus that often colonize for long periods of time before causing more acute infections [104,105].
An exciting new area of investigation afforded by WGSs is obtaining a better understanding of the “arms race” between the pathogen and the host, by using an integrative genomics approach that sequences both the host and the pathogen [106,107]. A recent study used an experimental host–pathogen model of the nematode Caenorhabditis elegans and its pathogen Bacillus thuringiensis to understand coadaptation. Interestingly the study showed that the phenotypic co-adaptation was explained by complex modifications both in the host and pathogen [108]. A recent joint GWAS study that sequenced both human and pneumococcal genomes suggested that 70% of invasiveness could be accounted for by bacterial genomic variation, but there seemed to be no effect on severity. Human genetic variations, on the other hand, explained half of the variation in meningitis susceptibility and a third of meningitis severity [109].
Data gained and data lost: scalability and database driven technologies
Because of the sheer volume of data, and the fact that our comparative algorithms do not scale well, many of our analytical techniques require data reduction and loss. For instance, most of the phylogenetic examples above use a comparative reference composed of either core genes or a single reference genome. Any sequence data not in the reference is lost. This missing data decreases the discriminatory power for WGS analyses, and may lead to inaccurate statements about relatedness or critical gene/SNP content. Using multiple references might help gauge the sensitivity of the analytical outcomes to reference choice, but the choice of which references to try is problematic, and also does not scale well.
Ideally, we would have ways to leverage all of the data and make full-scale comparisons between all of the genomes in the database. One approach is to reduce sequences to representative “sketches”, which could be used for efficiently calculating whole genome nucleotide similarity between sequences [56]. Tools such as Mash can very quickly identify the closest genomes from large databases and cluster large numbers of genomes [110,111]. However, by reducing the data, critical data about the basis of that similarity is lost. One promising new approach for retaining individual genomic data is using a graph structure representation of the genetic variations from a population of genomes such as in the tool variation graph (vg) [54]. The use of variation graphs is still in its infancy, but a recent tool “Sequence Tube Map” by Beyer et al. makes the visualization of variation graphs easier and more intuitive [55]. Pangenomic techniques such as PIRATE [112], MetaPGN [113], and Roary [36] may also offer ways to more fully estimate WGS diversity in a computationally tractable way.
Another source of data loss comes in the necessity to exclude genomes from phylogenetic analyses because of the computational cost with increasing numbers of taxa. Including publicly available genomes is generally a good practice because it provides context, enhances reproducibility, and may also fill in gaps in temporal data. However, the choice of which genomes to use is not trivial, especially in well sampled groups.
One way to access the full potential information of WGS databases, is to make the database the object of query in comparative analyses. Database driven methods like BLAST [114] have been absolutely indispensable, but become cumbersome when comparing whole genomes. We recently used a data compression approach to create a comparative tool that leverages all the diversity in the database. The tool, “WhatsGNU” (https://github.com/ahmedmagds/WhatsGNU), uses an exact-match, proteomic compression technique to remove redundant sequences keeping one copy of each protein allele, while preserving the genomes’ identifiers and all associated metadata (such as geographical location and clonal complex). This approach allows very rapid comparison of newly sequenced genomes to the compressed database to identify novel protein sequences and relate these to metadata. Recently we used this tool to detect novelty in skin adapted S. aureus genomes [115].
Database-driven techniques are, of course, affected by accessibility, composition, and the quality of the database itself (Box 2). Techniques are needed that can assess bias, error, and measure complexity. For instance, simple collectors’ curves of pan-genomes were instrumental in establishing the “open pan-genome concept” in which continuous horizontal gene transfer adds to the evolution of a species. Likewise, panallelome approaches such as WhatsGNU can evaluate the accumulation of new allelic variants as new genomes are sequenced (Figure 1).
Box 2. Current limitations and potential solutions.
Inconsistent annotations and inaccurate metadata
Specialized species databases such as Staphopia [116], Enterobase [117] and Pseudomonas Genome Database [118] help considerably in curation and quality control, but they take large amounts of dedicated time and funding.
Sampling bias
More than 50% of the 10,000+ assembled S. aureus genomes currently available in NCBI are from clonal complexes 5 and 8. To overcome this problem, unbiased sampling techniques are needed with a continuous assessment of the unsampled sequence diversity (eg., Figure 1).
Data storage
Generally, data reduction [56] and compression approaches [41] along with graph genomes [54] will help in mitigating storage problems. In addition, increased adoption of cloud-computing and storage will help smaller labs with limited infrastructure.
Data shareability
Data sharing is a crucial pillar in pathogen surveillance, and it is also critical in speeding up outbreak response times [119–121]. One of the issues that face data sharing is concern for loss of control over data and needed protection for Protected Health Information, all of which results in delay in access to data. One potential solution would be decentralized technologies [122,123] like private or permissioned blockchains which would allow temporary granted access and anonymity for patient data.
A framework such as the BioCompute Object [124] that has all the tools’ versions, computational parameters, dependencies, usage and commands for a bioinformatic pipeline should be included as supplementary methods in all publications. In addition, whenever possible, multiple parallel bioinformatic pipelines should be used for the same analyses.
Figure 1.
A collector’s curve expresses the number of exact matches (unique alleles) as a function of the number of genomes sequenced. 1000, 10000, 50000, 75000,100000, 125000,150000, 175000 and 200000 genomes from the 216,642 S. enterica genomes available on Enterobase [117] were randomly selected. The random sampling step was done three times independently with replacement. Error bars are shown in green. Note that though the slope of the curve gets less steep over time the curve does not plateau, representing unsampled diversity and/or continuing generation of allelic diversity through evolution.
Conclusions and Future Directions
The use of WGS has caused a leap forward in understanding the spread of bacterial infections with enhanced resolution for tracing transmission and spread. Technologies and infrastructure that can rapidly, and prospectively, sequence whole genomes in clinical settings will change molecular epidemiology, and will likely have a direct impact on treatment, prevention, and other interventions. A powerful aspect of WGSs is that they can be used to make predictions about function. As we go forward, it will be critical to find ways to efficiently test the biological hypotheses generated by WGS analyses at the bench or in the field, moving these studies closer to fundamental mechanisms with possible interventional targets.
In the bioinformatic sphere, WGSs have brought a new appreciation for the problems associated with enormous amounts of data, and it has become clear that our current tools may not be sufficient to extract all of the pertinent from these ever-expanding data. This will only be exacerbated by new datasets with parallel, integrative genomics (host/pathogen or host/microbiome/pathogen), or with spatial or temporal components. We need to develop new bioinformatic tools, that scale well to take advantage of the enormous numbers of genomes being produced, allowing for better and more rapid inference of biologically and epidemiologically important information.
72% of the available 1,067,277 bacterial genomes on NCBI are from only 10 species.
WGS is a new gold standard for detection of clonal outbreaks and transmission events.
WGS ancestral reconstruction and GWAS allow unbiased functional prediction.
WGS of longitudinal infection can be used to detect host adaptations.
The data loss common in WGS methods may be tackled with database-driven approaches.
