Abstract
In bacteria functionally related genes comprising metabolic pathways and protein complexes are frequently encoded in operons and are widely conserved across phylogenetically diverse species. The evolution of these operon-encoded processes is affected by diverse mechanisms such as gene duplication, loss, rearrangement, and horizontal transfer. These mechanisms can result in functional diversification, increasing the potential evolution of novel biological pathways, and enabling pre-existing pathways to adapt to the requirements of particular environments. Despite the fundamental importance that these mechanisms play in bacterial environmental adaptation, a systematic approach for studying the evolution of operon organization is lacking. Herein, we present a novel method to study the evolution of operons based on phylogenetic clustering of operon-encoded protein families and genomic-proximity network visualizations of operon architectures. We applied this approach to study the evolution of the synthase dependent exopolysaccharide (EPS) biosynthetic systems: cellulose, acetylated cellulose, poly-β-1,6-N-acetyl-D-glucosamine (PNAG), Pel, and alginate. These polymers have important roles in biofilm formation, antibiotic tolerance, and as virulence factors in opportunistic pathogens. Our approach revealed the complex evolutionary landscape of EPS machineries, and enabled operons to be classified into evolutionarily distinct lineages. Cellulose operons show phyla-specific operon lineages resulting from gene loss, rearrangement, and the acquisition of accessory loci, and the occurrence of whole-operon duplications arising through horizonal gene transfer. Our evolution-based classification also distinguishes between PNAG production from Gram-negative and Gram-positive bacteria on the basis of structural and functional evolution of the acetylation modification domains shared by PgaB and IcaB loci, respectively. We also predict several pel-like operon lineages in Gram-positive bacteria and demonstrate in our companion paper (Whitfield et al PLoS Pathogens, in press) that Bacillus cereus produces a Pel-dependent biofilm that is regulated by cyclic-3’,5’-dimeric guanosine monophosphate (c-di-GMP).
Author summary
In bacterial genomes, biological processes are frequently encoded by neighbouring co-transcribed genes, termed operons. In addition, operon-associated genes often belong to distinct evolutionary families with diverse biological functions. Studying the evolution of bacterial operons provides valuable insight into understanding the biological significance of genes involved in environmental adaptation. To date, no systematic approach has been devised to examine both the complex evolutionary relationships of operon encoded genes and the evolution of operon organization as a whole. To address this challenge, we developed an integrative method to study operon evolution by combining phylogenetic tree based clustering and genomic-context networks. We applied this method to perform the first systematic survey of all known synthase-dependent exopolysaccharide biosynthetic machineries, demonstrating the generalizability of our approach for operons of diverse size, protein family composition, and species distribution. Our method identified distinct biofilm operon clades across phylogenetically diverse bacteria, that result from gene rearrangement, duplication, loss, fusion, and horizontal gene transfer. We found different evolutionary trajectories for Gram-negative and Gram-positive PNAG biofilm production machineries, and in a companion paper (Whitfield et al PLoS Pathogens, in press) present experimental validation that the Pel polysaccharide is produced by a Gram-positive bacterium.
Introduction
The generation of novel genomes through next generation sequencing is creating a wealth of opportunities for understanding the evolution of biological systems. A key challenge is the development of robust and systematic approaches that allow genes to be classified into functional categories and which are also capable of inferring evolutionary relationships. In bacterial genomes, functionally-related genes corresponding to metabolic pathways or protein complexes are often encoded by neighbouring co-regulated and co-transcribed loci, termed an operon. Computational prediction of operons based on the conservation of short inter-genetic distances found between homologous genes across phylogenetically diverse bacteria has been frequently used to predict the biological roles of neighbouring uncharacterized genes [1–6]. Such approaches are valuable for computational inference of gene function in experimentally uncharacterized organisms and facilitate comparative genomics of adaptive traits across phylogenetically diverse bacteria.
Analyzing patterns of sequence divergence within each gene yields insights into species-specific functionalities. However, genes in an operon do not function in isolation but typically form parts of higher-order, biological modules (e.g. protein complexes or metabolic pathways). Consequently, analysing evolutionary events in an operonic context provides additional opportunities to better infer functional relationships. For example, while sequence divergence has the potential to impact the function of a single gene, evolutionary events that alter operon structure (e.g. rearrangements, duplications, gains and losses) have the potential to dramatically alter the overall function of the operon [7,8].
Due to the lack of a systematic framework, very few studies have attempted to examine the influence of evolutionary events on operon structure [9,10]. Phylogenetic-tree based classification of 197 ATP binding motif sequences associated with operon-encoded bacterial ATP-binding cassette (ABC) transporters was successful in resolving two evolutionarily distinct transporter clades associated with import and export functions [11]. Gene duplications have been shown to play an important role in driving protein superfamily expansion and are positively correlated with bacterial genome size [12]. Duplications have been found to be associated with biological processes associated with environmental adaptation of species clusters [13], such as outer membrane polysaccharide export proteins involved in capsule biosynthesis [14,15], and amplification of beta-lactamase enzymes associated with increased antibiotic resistance [16]. The study of co-localized “gene blocks” across bacteria has also shown that gene duplication, loss, and rearrangement play important roles in shaping the large-scale organization of bacterial genomes [10]. Key to these analyses is the use of a rigorous and systematic approach for assigning genes into evolutionarily related ‘families’ that are likely to share similar functions. However, the inference of biological function based on sequence similarities of genes or proteins is often complicated by functional divergence arising through recent gene duplication events. A variety of metrics have been employed for determining the relatedness of genes and their protein products from which groups (i.e. clustering) can be defined. These metrics include: evolutionary distances derived through the construction of phylogenetic trees [17–19]; global protein sequence similarities [20–22]; and shared sequence features such as conserved amino acids at specific sites or shared amino acid subsequences, which define motifs or structural domains [23,24]. The aim of these approaches is to automatically resolve large protein families comprising potentially thousands of genes into a smaller number of clusters defining evolutionarily related subfamilies with similar biological roles.
An additional challenge faced by clustering methodologies is defining which set of clusters result in an optimal partitioning of the underlying data. To help guide such partitioning, a variety of cluster validation approaches have been devised. These are broadly divided into two categories: external-validation and internal-validation, based on whether previous information is available for the data being clustered [25]. For example, methods developed for classifying orthologous genes (i.e. those related by common ancestry) and paralogous genes (i.e. those emerging from duplication after a speciation event) rely on internal-validation tests. In one such approach, internal branch lengths between one-to-many homologous gene relationships are compared between two species [26]. In an alternative approach, clustering is employed to define “triangles” of proteins with significant sequence similarity occurring between three distinct species [20,27]. However, to distinguish the finer-scale evolutionary relationships occurring within an orthologous group or gene family, phylogenetic tree construction is required. Such methods have typically focused on well-characterized biological systems, e.g. homologs of the bacterial flagellar and type III secretion system subunits [28] and diverse systems associated with the type IV filament superfamily [29], which have utilized an external-validation approach for defining functionally distinct phylogenetic clades.
Here we build on these methods and present a framework for the systematic classification and analysis of diverse gene families in the context of operons. Focusing on synthase-dependent exopolysaccharide (EPS) biosynthetic machineries, we use our framework to explore how gene divergence in combination with duplication, loss, and rearrangement events have shaped the evolution of EPS operons, and may have influenced the biofilm producing capabilities of evolutionarily diverse bacteria.
EPS are an important component of bacterial biofilms that not only ensure survival in response to limited nutrient availability, but are also involved in antibiotic tolerance, immune evasion and serve as virulence factors in many clinically relevant pathogens [30–32]. Distinct mechanisms have been identified in the production of bacterial EPS, including the well-characterized Wzx/Wzy and ABC transporter-dependent pathways [33], and synthase-dependent systems [34].
Typically, Gram-negative synthase-dependent EPS systems are organized as discrete operons comprised of genes encoding: 1) an inner membrane associated polysaccharide synthase; 2) a regulatory domain or co-polymerase subunit responsible for binding the intracellular signaling molecule cyclic-3’,5’-dimeric guanosine monophosphate (c-di-GMP); 3) periplasmic polysaccharide modification enzymes; and 4) a periplasmic tetratricopeptide repeat (TPR) domain coupled with an outer membrane pore [34]. This operonic organization allows bacteria to acquire complete EPS functionality through discrete lateral gene transfer events and may act as a key driver in niche adaptation [35]. To date five synthase-dependent EPS have been identified: cellulose, acetylated cellulose, poly-β-1,6-N-acetyl-D-glucosamine (PNAG), alginate and the Pel polysaccharide. While much interest has focused on the molecular basis of biofilm formation, these systems have been characterized for only a relatively limited set of bacterial species. Consequently, little is known concerning how these systems have evolved. Of interest is how mechanisms such as gene divergence, duplication, acquisition, loss, and rearrangement of EPS operons have contributed to bacterial adaptation to diverse environments, and from a human health perspective, contributed to a pathogen’s ability to infect and cause disease. While a previous survey of cellulose EPS machineries has been reported [36], a comprehensive systematic analysis of all EPS machineries is lacking.
In this study, we describe a phylogenetic tree-based clustering method for defining protein sequence subfamilies and apply it to study the evolutionary relationships of operons. This method was employed for the systematic classification of EPS operons predicted from a survey of over a thousand bacterial genomes. Applying a graphical visualization approach, we demonstrate that phylogenetic clustering enables the resolution of discrete EPS operon clades which differ in their organization from experimentally characterized operons, providing valuable insights toward further understanding the roles of gene duplication, rearrangement, and loss/absence in the evolution of biofilm production between phylogenetically diverse species from distinct environmental niches. For example, we demonstrate the biological implications of operon evolution that has been shaped by horizontal gene transfer (HGT) and subsequent divergence, for two cellulose operon clades among Proteobacteria which correspond to the production of cellulose polymers with different structural organizations. Furthermore, we note that most of our operon predictions are novel and demonstrate the value of applying computational predictions to guide the discovery of EPS production in previously uncharacterized species. We highlight an example for Pel production, which was initially identified and characterized in Pseudomonas aeruginosa [37] and other Gram-negative bacteria. Our approach identified several pel-like operons in some Bacillus spp. and other Gram-positive bacteria, which appeared to be regulated by the intracellular signaling molecule c-di-GMP. In our companion paper (Whitfield et al PLoS Pathogens, in press) we experimentally validate these findings by demonstrating the production of Pel by the Gram-positive Bacillus cereus ATCC 10987 and its regulation by c-di-GMP.
Results
A systematic survey of bacterial EPS operons reveals EPS systems across bacteria of diverse lifestyles and environmental niches
A schematic overview of the pipeline we used for classifying bacterial operons is provided in Fig 1A. The process begins with a systematic survey of all five previously characterized synthase-dependent EPS systems (cellulose, acetylated cellulose, PNAG, Pel, and alginate) (S1 & S2 Tables) through an iterative hidden Markov-model (HMM) -based search strategy and subsequent genomic-proximity based reconstruction of 1733 complete reference and representative bacterial genomes (downloaded April 20, 2015—see Methods). We identified 407 cellulose, 321 PNAG, 146 Pel, 64 alginate, and 4 acetylated cellulose EPS “operons” defined as comprising at least: 1) a polysaccharide synthase subunit; and 2) one additional locus involved in EPS modification or transport as defined previously [30] (S3 Table). These could be allocated to 367, 288, 140, 60 and 4 different bacterial species, respectively (Fig 1). We note that for all previously characterized EPS producing species included in this set of fully sequenced genomes, we successfully detected an operon corresponding to the type of EPS produced (S4 Table). Furthermore, for experimentally characterized species lacking a fully sequenced genome, we also identified several closely related strains possessing an EPS operon identical to the characterized strain, providing a valuable resource for further experimental validation. PNAG was significantly enriched in pathogen genomes (161/288–56%; Chi-squared test p-value = 2.05e-09). Conversely, Pel (84/140–60%; Chi-squared test p-value ~ 1.83e-4), alginate (39/60–65%; Chi-squared test p-value ~ 3.5e-2) and cellulose (187/367–51%; Chi-squared test p-value = 2.05e-14) were significantly enriched in non-pathogen genomes (Fig 1C and S1 Fig). Interestingly, both cellulose and PNAG operons were significantly associated with genomes with host-associated lifestyles (Chi-squared p-values ~ 1.05e-9 and ~1.48e-12, respectively). From a phylogenetic perspective, each EPS system was well represented by Proteobacteria, with cellulose, PNAG and Pel additionally featuring operons from Bacilli and Clostridia, which to our knowledge have not been previously reported (Fig 1D). Further, we note that Pel operons exhibited the greatest diversity of bacterial families (Shannon index of bacterial families– 2.74) with representation in Thermotogae, Actinobacteria and Rubrobacteria, among others. While most genomes contain only a single synthase-dependent EPS system, we observed many instances of co-occurrence (Fig 1E), with cellulose and PNAG systems being the most common combination (83 genomes), followed by alginate and Pel (20 genomes). Notably, all species possessing three systems were Pseudomonas spp., e.g.: Pseudomonas protegens strains Pf-5 and CHA0 (alginate, Pel and PNAG); Pseudomonas fluorescens SBW25 and Pseudomonas sp. TKP (acetylated cellulose, alginate, and PNAG).
Evolution of EPS operons is driven by gene duplication, loss and rearrangements
The processes underlying EPS operon evolution across diverse bacterial phyla are poorly understood. We examined how operon organization is influenced by the following evolutionary events that are likely to affect EPS production capabilities among bacteria: 1) single locus or whole operon duplications, which could lead to dosage effects and alter the level of EPS modification or export; 2) locus losses, that may indicate a reduction or loss in EPS production or modification, or may suggest supplementation of the lost function with a novel gene; 3) operon rearrangements which may affect the regulation of EPS production through the order of expression of individual EPS system components; and, 4) gene-fusions, resulting in enhanced co-expression of interacting subunits.
For each set of predicted EPS operons, the resulting number of operon evolutionary events was defined relative to the locus composition and order of reference Gram-negative experimentally characterized operons [30,38–41]. In contrast to previous studies of operon evolution [28,29], we use the term evolutionary events to refer to key changes that define distinct organizations between evolutionarily distinct operon clades and not divergence of operons from an ancestral state. With the exception of acetylated cellulose, locus losses were found to be the most frequent event (~46% of predicted operons lacked one or more reference loci), and occurred with the greatest frequency for Pel which exhibited an average loss of 2.6 loci lost per operon (S5 Table). Among all EPS systems the majority of locus losses were associated with the outer membrane pore encoding loci (441 / 993–44% of all locus loss events identified) among Gram-positive species (S5 Table), consistent with the lack of an outer membrane bilayer in Gram-positive cell envelope architectures. Operon rearrangements were the next most frequent evolutionary events (~ 39%), largely associated with cellulose operons [36] (327 / 407–80%). Focusing on duplication events, within-operon loci duplications tended to be more common than whole-operon duplications (2 or more core EPS loci identified > = 1 kb apart), with the exception of cellulose operons (29 whole operon duplications compared to 24 loci duplications). All duplicated operons were found to be separated by at least 10 kb, suggesting they may have been acquired through HGT rather than tandem duplication of a pre-existing operon [42,43].
Systematic phylogenetic distance-based clustering of EPS operon loci and genomic-proximity networks identifies evolutionarily distinct operon clades
To better understand how these evolutionary events may have altered operon function, we next devised an agnostic, systematic classification strategy to cluster each family of EPS operon loci on the basis of phylogenetic distance (Fig 2A; see Methods). In brief, for each EPS operon locus, multiple sequence alignments were generated and used to construct phylogenetic trees. From these trees, we defined sets of clusters through an iterative scan of the tree structure that captures an increasing sequence distance between family members, starting at the leaves and ending at the root. During this scan, sequences are grouped into a cluster if they share a common node (i.e. are within a specified evolutionary distance). To define the optimal set of clusters for each locus, we then applied three cluster quality scoring schemes (Q1, Q2 and Q3) based on the following metrics: proportion of sequences clustered (to maximize the number of sequences clustered); average silhouette score (to minimize the occurrence of clusters containing highly divergent sequences); and the Dunn index (to maximize the separation of closely related sequences from divergent sequences). For each scoring scheme, we defined the optimal pattern of clustering based on the evolutionary distance (expected number of substitutions per site) derived from a maximum-likelihood constructed phylogenetic tree (see Methods for more details) that maximizes the quality score. Comparisons across scoring schemes (see below) for cellulose operon loci identified Q2 as providing the most informative sets of clusters. Applying this scoring scheme to all EPS loci revealed the average number of sequence clusters generated correlated with the total number of operons predicted for each type of EPS system (Fig 2B), which further corresponded to the underlying differences in species distributions of EPS systems (Fig 1D). For example, the cellulose system was predicted to have the largest average number of sequence clusters overall (30 clusters) and also had the greatest species diversity (Shannon index 2.16 –S2 Fig) compared to all other systems. Furthermore, for each EPS system the variability of the number of sequence clusters predicted per locus (Fig 2C) suggests differing degrees of locus evolution that are likely to be the result of different structural and functional constraints. For example, a higher degree of conservation would be expected for glycosyl transferase (GT) subunits to maintain efficient co-ordination between polymerization and inner membrane transport of EPS, while increased variability of periplasmic modification enzymes suggests that only a subset of highly conserved motifs are required to carry out polysaccharide modification reactions.
To compare patterns of clusters identified by each scoring scheme, we applied the three scoring schemes to each set of genomically-neighbouring protein sequences assigned by HMM searches cellulose EPS machinery. Here we focused on four core cellulose genes: bcsA, bcsB, bcsZ, and bcsC encoding the inner membrane GT, co-polymerase, periplasmic hydrolase and outer membrane export pore, respectively [36]. While previous studies have identified additional genes associated with the production of cellulose biofilms (e.g. bcsD, bcsE, bcsF, bcsG and bcsQ; [36]), to simplify our analyses we limited our query to bcsA, bcsB, bcsC, and bcsZ based on the observation that these are the most abundant genes when comparing all identified cellulose biosynthesis operons. Although this may bias the diversity of operons identified, given that a functional cellulose biosynthesis locus should contain the synthase genes bcsA and bcsB, we expect our analysis should result in few false negative predictions. From the resulting clusters we generated operon genomic-proximity networks (Fig 2D). These networks provide a visual display of the conservation of individual loci, together with their respective genomic proximity to yield patterns of sequence divergence associated with the emergence of distinct forms of operon organization. In the absence of any clustering (Q0), the network trivially resolves into individual operons featuring up to four loci. Applying the Q1 scoring scheme to each locus, the network reveals a variable number of clusters across operon loci, with each cluster generally comprising sequences belonging to the same bacterial genus. Application of the Q2 scoring scheme results in the generation of clusters of increased size, encompassing species featuring distinct operon organizations and compositions. For example, two distinct lineages of alpha-proteobacterial cellulose operons can be easily distinguished, one of which is more closely related in sequence and composition to gamma-proteobacterial operons, and a second which lacks two loci and appears evolutionarily divergent from gamma-proteobacterial operons [25]. However, these distinctions were more difficult to resolve using the Q3 scoring scheme due to clustering of highly divergent sequences. Given the trade-off between clustering highly divergent sequences (Q3) with the depiction of individual operons (Q1), we applied the Q2 scoring scheme to generate clusters for all EPS loci (S6 Table).
Using this locus-specific phylogenetic clustering approach, we were able to devise a classification scheme to define EPS locus clades based on the average evolutionary distance of a group of clustered locus sequences to a reference operon sequence (S3 Table). For example, the cellulose polysaccharide synthase locus, bcsA, from Escherichia coli is assigned to clade 1, while divergent alpha-proteobacterial species including Rhodobacter sphaeroides are assigned to clade 2. We further resolved operons into distinct groups based on the genomic co-occurrence patterns of locus clades; e.g. for the cellulose operon (bcsABZC) we identify clade combinations of 1:1:1:1, 1:2:2:2 and 1:3:5:3, which correspond to operons identified in Escherichia spp. and other closely related enterobacteria, Klebsiella spp., and Burkholderia spp., respectively.
Phylogenetic clustering and genomic proximity networks reveal evolutionary events driving EPS operon divergence
Having generated clustering patterns for each EPS locus, we next used the sets of cellulose operon associated loci, bcsA, bcsB, bcsZ, and bcsC, to examine how these patterns might inform on the evolution of this EPS system. Relative to BcsA, the three other subunits (BcsB, BcsZ and BcsC) display greater sequence diversity as indicated by a larger number of sequence clusters (Fig 3). Detailed structure-function studies of the BcsA-BcsB inner membrane cellulose synthase complex, outlined below, illustrate how these findings are consistent with their known functional roles. Further inspection of the cellulose operon network identifies a number of sub-networks comprised of taxon-specific loci clusters associated with distinct patterns of operon organization as illustrated through the following examples: 1) a subnetwork comprised of loci from several beta-proteobacteria, represented here by Burkholderia cenocepacia and Pandoraea promenusa (Fig 3(i)), which feature a rearrangement of the bcsA locus and novel locus gains (also supported by inspection of corresponding Genbank genomic annotations) as indicated by a genomic distance of > 0.1 kb between bcsA and the neighbouring locus bcsC; 2) a subnetwork composed of loci from several species of the alpha-proteobacterial Zymomonas show rearrangement of bcsZ and/or the loss of bcsB or bcsZ (Fig 3(ii)). Further inspection reveal such losses were due to gene fusion events; 3) a subnetwork composed of loci from a separate group of alpha-proteobacteria which reveals a diverse set of bcsB loci that additionally lack the bcsC outer membrane pore (Fig 3(iii)); 4) a subnetwork of loci from a group of gamma-proteobacteria reveal instances of HGT and divergence (Fig 4A). In this latter example, our network identifies two distinct clades of operons, sharing a common group of bcsA loci, but featuring two evolutionarily divergent sets of bcsB, bcsZ and bcsC loci which co-occur in several genomes separated by inter-genic distances greater than 10 kb. Detailed investigation of the operonic arrangements of species possessing single copies of either of these clades of operons reveal two distinct loci organizations: the first representing the canonical cellulose locus order (clade A1), bcsABZC, found among E. coli and Salmonella enterica strains; while the second represents a non-canonical locus ordering (clade B1), in which the periplasmic glycoside hydrolase, BcsZ, has undergone a rearrangement, bcsABCZ. This clade is found among Dickeya, Erwinia and Pantoea spp. (Fig 4B). Of note, we found that several species (e.g. Enterobacter and Klebsiella spp.) possess both operon clades. The clades have been suggested to have originated by HGT [19]; a hypothesis further supported by our phylogenetic clustering assignments (Fig 4C). Furthermore, we identified two additional divergent BcsB sequences associated with a novel organization of operon clade B1 and include several loci with other roles in cellulose production (designated operon clade B2; Fig 4D). The divergence of BcsB sequences associated with clade B2 were also found to distinguish bacterial genomes possessing multiple cellulose operons of distinct evolutionary lineages: Proteus mirabilis (2 cellulose operons: Clades A1 and B2) and Enterobacter spp. (3 cellulose operons: Clades A1, B1 and B3) (Fig 4E). Additional sequence database searches revealed that the non-core loci associated with operon clades B2 and B3 share functionally homologous loci to the cellulose accessory protein D (AxcesD), which has been characterized as increasing the efficiency of cellulose production in the Acetobacter xylinus cellulose synthase complex [44]; GalU an uridine triphosphate (UTP)-glucose-1-phosphate uridylyltransferase involved in cellulose precursor biosynthesis; and an additional uncharacterized locus predicted to possess both PAS_9 and GGDEF signaling domains, indicating the potential adaptation in Proteus and Enterobacter spp. to produce varied forms of cellulose upon different environmental stimuli [45].
Genomic-proximity networks of pel operons reveal a novel pel locus in the gram positive bacterium, bacillus cereus that is regulated by c-di-GMP
Examination of the genomic-proximity networks of pel loci also reveal novel operon organizations across phylogenetically divergent bacteria (Fig 5). As with cellulose loci bcsA and bcsZ, we identify examples of operon rearrangements involving pelB (outer membrane transport pore and TPR domain) loci and pelA (periplasmic modification hydrolase) (Fig 5(ii), 5(iiib) and 5(iv)), across several species associated with diverse environments. Again, consistent with our findings for cellulose, we noted loci losses and acquisitions. Although it has not been demonstrated that the pel operon forms a trans-envelope biosynthetic complex, the ordering of operon loci has been shown to play an important role in the assembly of macromolecular complexes [46] and optimizing biosynthetic pathways [47], suggesting that there exists a functional coupling between Pel outer membrane transport and periplasmic modification [48]. However, the effects of these rearrangement events on Pel production still remain to be experimentally investigated.
We also observed a high degree of overall conservation among components which are known to play key roles in Pel biogenesis, such as the putative polysaccharide synthase (PelF), putative inner membrane protein (PelG), hydrolase/deacetylase (PelA) and cyclic-di-GMP receptor (PelD) [32]. In contrast, a greater degree of divergence can be seen among inner (PelE) and outer membrane (PelB, PelC) transport associated loci. In these loci, there appears a consistent pattern of clustering across bacterial phyla suggesting co-evolution of potentially physically interacting components, however no evidence of interaction has been shown to date.
Our genomic proximity network revealed two distinct clades in several Gram-positive species (Fig 5(v)). Of the synthase dependent EPS operons known to date, only PNAG production has been genetically and structurally characterized in Gram-positive Staphylococci [49]. Operons reconstructed from initial HMM searches identified putative pel operons in several Gram-positive bacteria, comprised of the GT encoding PelF and the PelG putative transport protein (Fig 5). To determine whether these were bona-fide pel operons with additional loci, iterative HMM searches were performed including additional protein sequences from predicted pel operons. These searches revealed additional loci including a homolog of PelD (S3 Fig). C-di-GMP signaling in Gram-positive bacteria is less well characterized [50] and this finding suggests a role for this secondary metabolite in regulating biofilm formation in these species. In our companion paper, we have experimentally validated our predictions by showing that single gene deletions within the predicted B. cereus pel operon result in a loss of EPS production, and that PelD regulates EPS production through binding of c-di-GMP (Whitfield et al PLoS Pathogens, in press). This work provides the first and crucial piece of evidence which suggests that divergent Gram-positive pel operons, particularly those belonging to the same phylogenetic clade as B. cereus, likely possess the ability to produce a Pel-dependent biofilm. However, it is also possible that significantly divergent operon loci may not result in the production of EPS of identical composition to that characterized in Gram-negative bacteria, as in the case of PNAG modification between Gram-negative (pga) and Gram-positive (ica) operons [51].
Genomic-proximity networks of PNAG uncover locus loss and duplication events in pathogenic and environmental bacteria
To examine how locus duplication, loss, and rearrangement events have contributed to the evolution of PNAG operons across bacterial phyla, selected examples of pga operon clusters were identified and compared (S4 Fig). For example, within a group of enterobacteria possessing related pgaD loci, there exist a number of closely related pathogenic enterobacteria that have lost pgaA (E. coli ETEC H10407), as well as pgaB (Shigella flexneri 5 str. 8401). These losses suggest these taxa may no longer be able to produce PNAG (S4(i.a) Fig). Interestingly, we also observed a lack of pga operons among pathogenic Salmonella spp. genomes surveyed in this study. Previous work has shown that loss of PNAG production is associated with adaptation of Salmonella spp. to an intracellular pathogenic lifestyle [52]. Although PNAG production in E. coli H10407 has not been examined, our findings are also consistent with the adaptation of E. coli H10407 from a commensal to a pathogenic lifestyle [53], where enterotoxins and colonization factors serve a crucial role for attachment to the intestinal epithelium and enhanced toxicity [54,55]. Furthermore, it has been previously shown that biofilm formation in S. flexneri impairs invasive ability and virulence [56], which suggests that the loss of PNAG production in S. flexneri [57] is the result of adaptation to an intracellular mode of infection. These results shed light on the relationship between biofilm production capability and adaptation of enteric bacteria toward a pathogenic lifestyle.
Based on the divergence of pgaB loci, we also identified pga operon clades corresponding to partial and whole operon duplications in aquatic bacteria, including a partial duplication of the pga operon specific to the important pathogen Acinetobacter baumannii spp. and Methylovora versatilis 301, respectively (S4(ii) Fig). Also, in environmental bacteria we discovered a novel pga organization resulting from rearrangement of pgaC, and a lack of pgaB and pgaD loci, which may have been too divergent to detect from initial HMM searches (S4(iii) Fig). Although our HMM models only used Gram-negative pga operon protein sequences, we also identified a number of Gram-positive pga operons consisting of pgaB and pgaC (S4(i.b) and S4(i.c) Fig). Upon closer inspection these loci were found to correspond to Staphylococcus polysaccharide intercellular adhesion (PIA) loci icaB and icaA, respectively. This suggests a potential common evolutionary origin of synthase-dependent PNAG production between Gram-positive and -negative organisms.
A clade of pga operons were also identified possessing varying numbers of divergent pgaC loci resulting from repeated tandem duplication events (S4(v) Fig). Despite lacking a detectable pgaA locus, a possible role of these gene clusters in EPS production was investigated. One member of this operon clade, Thauera sp. MZ1T, inhabits a wide range of environments, and is an abundant producer of EPS responsible for viscous bulking in activated sludge wastewater treatment processes [58]. Furthermore, a recent mutagenesis study [59] demonstrated that biofilm-formation defective Thaurea mutants could be rescued by the complementation of the predicted pgaB deacetylase locus identified in the present study. Together, these findings suggest that the divergence of deacetylase and duplication of GT related loci in PNAG biosynthesis have resulted in the emergence of a distinct operon lineage.
Genomic proximity networks of alginate uncover distinct operon clades in pseudomonas spp. and atypical operon architectures in environmental bacteria
Although the majority of alginate operons were predicted largely among Pseudomonas spp. genomes (S5 Fig), phylogenetic clustering and genomic-proximity network reconstruction revealed an array of events influencing alginate operon evolution. For example, two distinct alginate operon clades were identified among Pseudomonas spp., defined by whole operon duplication and rearrangement of alginate polysaccharide modification loci (S5(i) and S5(ii) Fig). Also identified were divergent, “atypical”, alginate operons (S5(iii) Fig) comprising extensive rearrangements and also losses of functionally related subsets of alginate loci, e.g. outer membrane transport loci (algKE), and polysaccharide modification machinery (algGXLIJF). Closer examination of the alginate genomic-proximity network also indicated a greater number of clusters for alg44 and algX loci, which were reflective of increased divergence among distinct alginate operon clades. Given that both loci play related roles in the regulation, polymer-modification, and assembly of the alginate EPS secretion machinery [60], these results provide an avenue for future research toward elucidating how species may modify alginate production to adapt to diverse environmental niches.
Genomic proximity networks of acetylated cellulose operons reveals duplication of co-polymerase subunits and sequence homology of loci with alginate acetylation machinery
From the genome sequences surveyed, only four species were identified as possessing acetylated cellulose operons, comprising two distinct operon clusters with differing operon constitutions among three Pseudomonas spp. and Bordetella avium 197N (S6 Fig). Contrary to cellulose phylogenetic clusters, the polysaccharide synthase, wssB, was divided into distinct Gamma- and Beta- proteobacterial clusters. We also found a distinct phylogenetic cluster identifying a unique tandem duplication of wssC in Bordetella avium 197N, which was not observed among orthologous cellulose bcsB co-polymerase loci (S6(ii) Fig). This observation might suggest a divergent mechanism of action of cellulose inner membrane transport. As we previously observed (Fig 1E), 3 out of 4 of the predicted acetylated cellulose operons were also found to co-occur with alginate operons. Additional HMM-searches identified significant sequence similarity between acetylated cellulose wssBCDE operon sequences to those previously identified as bcsABZC, as well as between acetylated cellulose acetylation-machinery and their functional homologs in alginate operons (WssH–AlgI; WssI–AlgJ/AlgX). Taken together, these findings suggest that acetylated cellulose production has likely evolved through the duplication and operonic acquisition of the alginate acetylation machinery loci.
Sequence variability of phylogenetic clusters reveals different degrees of structural conservation of cellulose biosynthesis machinery
With the availability of a crystal structure for the BcsA-BcsB inner membrane complex responsible for cellulose biosynthesis and transport [61], we examined the potential structural and functional consequences of the sequence variability of the BcsA and BcsB phylogenetic clusters highlighted above (Fig 3). In brief, we generated multiple sequence alignments of eight BcsA and BcsB sequences summarizing the evolutionary diversity of cellulose operon clades identified in Fig 3. Residue conservation information from this alignment were subsequently mapped onto the structure of the BcsA-BcsB complex (PDB ID:4HG6 [62]; S7 Fig). The results of the following analysis are also consistent when including all predicted BcsA and BcsB sequences. We identified a high degree of sequence conservation among BcsA sequences corresponding to the GT domain responsible for cellulose polymerization. Conserved residues mapped specifically to a cleft in the GT domain where a uridine diphosphate (UDP) carrier moiety is bound and oriented through a conserved QxxRW motif to enable polymerization of glucose monomers of the growing cellulose chain [61]. Conversely, the PilZ domain of BcsA, involved in regulation of the GT function in response to c-di-GMP levels shows low conservation overall, except for the subset of residues required for c-di-GMP binding. Further, the periplasmic region of BcsB shows low sequence conservation overall, aside from a number of highly conserved residues in the carbohydrate binding and ferredoxin domains. These residues include one (L193 of the Rhodobacter sphaeroides ATCC 17025 reference sequence) representing a putative cellulose binding residue oriented in close proximity to the growing cellulose chain near the exit of the BcsA inner membrane translocation channel. From phylogenetic sequence clustering, structurally relevant conservation features of the cellulose synthase complex were identified which should facilitate further investigation of cellulose EPS production across phylogenetically diverse species. For example, c-di-GMP binding residues of the PilZ domain of BcsA vary in conservation across phylogenetic clusters, which could impact the ability of the protein to bind the nucleotide. This may in turn limit access of activated glucose monomers to the GT domain, thus altering the rate of cellulose polymerization. Insertion/deletion events are also observed across BcsB phylogenetic clusters that may facilitate the recruitment of additional periplasmic processing proteins [63], or macromolecular assembly of the BcsA-BcsB complex [64]. This might result in differences in the higher-ordered structuring of cellulose microfibres as a consequence of adaptation to diverse environmental niches. These results demonstrate how the application of our phylogenetic clustering methodology can be extended to provide biologically informative insights into the function of components of EPS secretion machineries.
Phylogenetic clustering elucidates the structural and functional divergence of the pgaB locus, revealing the evolution of PNAG production across gram-negative and gram-positive bacteria
PNAG production is found across phylogenetically diverse species and is carried out by the pgaABCD operon of Gram-negative bacteria [39] and icaADBC operon of Gram-positive bacteria [65]. Although the functional and immunological properties of PNAG produced by the pga and ica loci appear to be similar [66], there are important differences between the roles of pga and ica operon loci [67]. Common to both operons is the presence of an integral membrane GT locus, pgaC and icaA, which are both members of the GT-2 family and share sequence homology [67]. In addition, non-homologous loci encoding integral membrane proteins, pgaD and icaD, are also present and required for the full function of their respective GTs [66,68]. In Gram-negative bacteria, PNAG production is regulated through physical interactions between PgaD and PgaC which are stabilized by the allosteric binding of c-di-GMP [56]. In Staphylococci, PNAG production does not depend on c-di-GMP and is likely regulated by an alternate signaling pathway [69]. Deacetylation of PNAG is carried out by pgaB and icaB loci and has been shown to play a crucial role in biofilm formation and immune evasion [51,70]. pgaB also possesses an additional C-terminal glycoside hydrolase domain which cleaves the PNAG polymer following its partial deacetylation [71], although the mechanism of how these activities are coordinated and the biological role of the hydrolase activity is unknown. Unique to pga operons is a locus encoding an outer membrane export pore, pgaA [72] and, in ica operons, an additional integral membrane protein, icaC, which has been proposed to be involved in PNAG O-succinylation [67]. Using Gram-negative pga loci as seed sequences for the reconstruction of synthase-dependent PNAG operons, we were also able to identify Gram-positive ica operons based on significant sequence similarities to pgaB and pgaC loci. Our phylogenetic clustering approach also revealed that pgaC/icaA sequences clustered into a single clade, while pgaB/icaB were associated with distinct sequence clades (S4 Fig). To explore the evolution of Gram-negative and Gram-positive pga and ica operons, we generated multiple sequence alignments for representative sequences of 18 PgaB clades. Our phylogenetic clustering results confirm previous observations [67] that the glycoside hydrolase domain is exclusively associated with Gram-negative pga operons (PgaB_G1) and is absent in a clade of Staphylococcus Gram-positive ica sequences (PgaB_G3; S8A Fig). We also identified additional Gram-positive icaB clades among non-Staphylococcus spp., e.g. Bacillus and Lactococcus (S1 Table), which possess operons lacking the icaC locus [67]. Interestingly, we also identified a number of divergent Gram-negative pgaB clades resembling icaB clade sequences. Members of these clades lacked the canonical N-terminal glycosyl hydrolase domain, and were distinguished by possessing N-terminal fusions, primarily of GT domains. Furthermore, these pgaB clades are associated with operons lacking detectable pgaA outer membrane pore locus and pgaD (S4(v) Fig). Although PNAG production in these species has not been experimentally confirmed, these findings suggest that if the polymer is produced, it may be regulated through a novel mechanism, that glycoside hydrolase activity might not be essential for PNAG export across all Gram-negative species, and that other modes of export may exist. The N-terminal fusion of GT with PgaB de-acetylase domains would also suggest that the de-acetylase activity of PgaB in these organisms may be associated with the periplasmic face of the inner membrane, in contrast to dual domain PgaB clades where the protein is predicted to function at the periplasmic face of the outer membrane [72].
In addition to these novel domain fusion events, PgaB phylogenetic clustering enabled us to resolve distinct events affecting the evolution of the deacetylase domain across different operon clades. Using the E. coli K12 MG1655 sequence of the largest PgaB clade (PgaB_G1) as a reference, multiple sequence alignments against other representative PgaB clade sequences identified several regions of insertion/deletion events (S8A Fig). When these regions were mapped to the published crystal structure of PgaB (PDB ID: 4F9D [73]), they were found to correspond to distinct structural elements surrounding the conserved deacetylase core (S8B and S8C Fig). We assigned insertion/deletion regions a number according to their order of appearance in the multiple sequence alignment of PgaB deacetylase domains, and divided them into two categories (S8D Fig). The first two indel regions, 1 and 2, resided in the N-terminal region of the reference E. coli sequence, and corresponded to beta-strands flanking the conserved active site residues involved in deacetylation, His55, Asp114, and Asp115. Region 1 was associated with Gram-positive icaB and comprised insertions of ~10aa in Staphylococcus aureus VC40 (PgaB_G3), as well as Bacillus infantis NRL B-14911 (PgaB_G7), Lactobacillus plantarum 16 (PgaB_G9), Leptospirillum ferriphilum ML-04 (PgaB_11). Structural characterization of Ammonifex degensii IcaB (PgaB_G3) identified residues overlapping with Region 1 as encoding a hydrophobic loop responsible for membrane localization in this species [74]. Region 2 was found to be exclusive to Gram-negative pgaB loci and comprised a much larger insert of ~77aa in Geobacter metallireducens GS-15 (PgaB_G2), Crinalium epipsammum PCC 9333 (PgaB_G5), and Colwellia psychrerythraea 34H (PgaB_6). The functional role of this insertion is unknown.
The last three insertion/deletion regions, 3–5, occurred in a region oriented away from the deacetylase active site, and correspond to two beta-turn motifs and an alpha-helix cap, respectively. To further elucidate the biological import of identified PgaB indel regions, we examined regions 3, and 5 in the context of Gram-negative PNAG modification. In the E. coli K12 MG1655 PgaB_G1 sequence, region 3 encompasses a beta-turn with an elongated loop, which is spatially proximal to a disordered loop and alpha helix (pos. 367–392) on the N-terminal region of the PgaB glycoside hydrolase domain. Region 3 also encodes a histidine (E. coli PgaB—H189) which is part of the nickel binding pocket of Gram-negative PgaB deacetylases. Both regions contain polar and electrostatically charged residues which are highly conserved across PgaB_G1 sequences (S8E Fig). Region 5 corresponds to an 8 amino acid elongation of an alpha-helix (pos. 219–226), which also appears to provide an additional point of contact between the deacetylase and hydrolase domains. Although region 5 is also shared with icaB associated sequences (PgaB_G3), region 3 appears only in other dual deacetylase-hydrolase Gram-positive pgaB sequences identified in the sporulating bacteria Lachnoclostridium phytofermentans ISDg and Kitasatospora setae KM-5043. Although initial PFAM searches failed to identify the additional Gram-positive C-terminal domains, subsequent BLAST searches revealed them to be homologous to glycoside hydrolases. In region 4 a unique 29 amino acid insertion was also identified in Lachnoclostridium phytofermentans ISDg (PgaB_G16), which may play a compensatory role for the absence of 9aa in region 3. These insertion regions suggest an overall functional importance in ensuring stability between each domain and could play a role in coordinating their activities. These findings in combination with our identification of ica-like operon organizations among environmental Gram-negative species (S4(v) Fig) suggest that Gram-negative pga operons may share a common evolutionary origin with Gram-positive ica operons. Recent research is providing growing evidence for the emergence of the di-derm Gram-negative architecture from sporulating monodermal Gram-positives [75], which provides a plausible evolutionary context for the insertion/deletion events observed among pgaB/icaB deacetylase domains. Through the loss of inner membrane localization [74] (Region 1), the compensatory gain of an N-terminal palmitoylation site [66], along with a C-terminal fusion of a hydrolase domain (Regions 3–5), an ancestral deacetylase locus may have been adapted to regulate the export of PNAG [66] at the outer-membrane of Gram-negative pga operon lineage.
Discussion
In this work we describe a novel and generalizable approach for the systematic classification and presentation of bacterial protein families in the context of their host operon. Protein families are defined as sets of homologs (groups of related sequences having a common evolutionary ancestor) sharing a particular set of sequence motifs or structural domains that can be utilized to determine their biological roles. For example, the PFAM database utilizes curated sets of protein family sequences in the generation of profile HMMs [76]. A key challenge that complicates the definition of these relationships are evolutionary events such as duplication, gene fusion, and HGT. In attempts to account for such events, a variety of computational approaches have been developed for refining functional assignments. These operate either by graphical clustering of pair-wise protein sequence similarities (e,g, COG [27], OrthoMCL [19] and EggNOG [20]), or through the generation of hierarchical evolutionary relationships and construction of phylogenetic trees (e.g. TreeFAM [77] and TreeCL [78]). However, these methods are limited in their ability to provide further resolution of sequence diversity within a family that might otherwise offer additional insights into evolutionary events that allow taxa to adapt to specific environments.
Agnostic approaches to define sub-clusters of evolutionarily related protein families have ranged from phylogenetic tree reconstructions [79] to hierarchical clustering of pairwise global sequence alignments [80]. Here we present an extension of previous efforts and introduce a novel systematic approach for defining protein sub-family relationships through the clustering of phylogenetic trees. Key to this approach is defining a scoring function that allows a phylogenetic tree to be resolved into optimal clusters that best capture the similarities between cluster members, as well as the dissimilarities between clusters. Combining two clustering quality metrics (Silhouette and Dunn index) and proportion of sequences clustered, we demonstrate that our approach classifies a diverse array of operon-associated protein families into taxonomically consistent and functionally informative sub-clusters. Genomic-proximity networks were also constructed to provide an intuitive means of utilizing phylogenetic clusters to examine diverse mechanisms of operon evolution across taxonomically diverse bacterial genomes. Genomic-proximity networks have previously been utilized for inferring functional relationships [81], understanding mechanisms underlying bacterial genomic organization into functionally related gene clusters [82], and transcriptional regulation of bacterial operons [83]. In this study we extend the application of genomic-proximity networks as a tool for the systematic exploration of operon evolution resulting from locus divergence, loss, duplication, and rearrangement events.
To demonstrate the effectiveness of our approach, we applied our methods to classify the synthase-dependent bacterial EPS operon machineries for 5 different polymers: cellulose, acetylated cellulose, alginate, Pel and PNAG. There has been only one previous attempt to classify synthase-dependent EPS operons and this focused specifically on the cellulose system [36]. In that study, cellulose operons were categorized into four major types, based on the presence or absence of experimentally validated accessory loci involved in cellulose production. Here, we based our analysis on the four core operon loci, bcsABZC, deemed essential for cellulose production. Cellulose operon clades identified in this study showed little consistency with the previously defined four major cellulose operon types [36], suggesting that the conservation of accessory loci is more variable across bacterial species compared to loci encoding core EPS functionalities. However, one operon type was identified in this analysis, representing the loss of the BcsC outer membrane transporter identified among a subset of alpha-proteobacterial genomes, which include several known cellulose producing species [62,84] suggesting a novel mechanism of cellulose export (Fig 3(iii)) [36]. We also found that the loss of BcsC has resulted in an increased divergence of BcsB loci in these genomes, which highlights the key role of BcsB as an intermediary between cellulose biogenesis and periplasmic transport (S7 Fig).
In general, inner membrane components involved in EPS polymerization were found to be relatively conserved across all systems examined, while periplasmic and outer membrane components showed a relatively increased degree of evolution, which are likely to have important functional implications. For example, in the cellulose and Pel operon networks (Figs 3 and 5 and S5 Table), rearrangement events involving the periplasmic glycosyl hydrolase (BcsZ) and glycosyl hydrolase/deacetylase (PelA) were found to be a defining feature of several operon clades. It is interesting to note that these rearrangements have resulted in a change in the ordering of bcsZ and pelA relative to their respective outer membrane transport pore loci, which highlights the important role of polysaccharide modification in both the biogenesis and regulation of extracelluar EPS transport [48,85,86]. Similarly, the rearrangement of alginate modification machinery loci (algIJF) was observed as a distinguishing feature of Pseudomonas spp. operon clades. These findings suggest that rearrangement and locus ordering may serve as an important means of regulating EPS production by modifying the timing of translation of modification enzymes, which could affect the assembly of EPS complexes or the structural properties of EPS produced [47,64,87].
Furthermore, identifying operon clades through a phylogenetic approach elucidated numerous instances of cellulose whole operon duplications arising from HGT of two evolutionary distinct operon clades (Fig 4). Such large-scale duplications, if they are functional, may either serve as a dosage response to given environmental stressors, as observed in the duplication of bacterial multiple-drug transporter operons [88], or could be under the regulation of different environmental stimuli. Interestingly, representative species of the two cellulose operon lineages identified in HGT events, e.g. the plant and human pathogens, D. dadantii and S. enterica, respectively, are known to produce structurally distinct forms of cellulose with different properties and roles in pathogenesis [89,90]. In addition, we identified that BcsB divergence was also seen to accompany the rearrangement or horizontal transfer of these operons, which further suggests that it may play a key role in the fine-tuning of cellulose production by coordinating the export of growing cellulose polymers through the periplasm. Furthermore, our analyses of acetylated cellulose, alginate and PNAG operons suggest a dynamic evolutionary scenario for the evolution of EPS biofilm production through the acquisition of novel polysaccharide modification loci. The limited number of acetylated cellulose operons identified, their frequent co-occurrence in alginate possessing species, and significant sequence similarities between acetylation machinery loci, suggests that the cellulose acetylation machinery is likely to have originated from previously existing alginate operons in Pseudomonas spp. The evolutionary trajectories of Gram-positive and Gram-negative PNAG operon lineages appears to have resulted through the fusion of glycosyl hydrolase and deacetylase domains in Gram-negative pgaB loci.
A further key finding from this study was the identification of homologous pel operons in the genomes of several Gram-positive bacteria. With the additional identification of homologs of PelD through iterative HMM searches, our analyses have uncovered a novel example of c-di-GMP regulation of biofilm machinery in Gram-positive bacteria. In the accompanying paper we experimentally validate that a predicted pel-like operon in B. cereus ATCC 10987 is responsible for biofilm production and is regulated by the binding of c-di-GMP to PelD (Whitfield et al PLoS Pathogens, in press).
Together this work demonstrates a novel integrative approach combining phylogenomics and genomic-context approaches to systematically explore the adaptive implications of sequence divergence of protein families associated with operon associated EPS secretion machineries. Further extension of this work holds great potential as a general approach for elucidating how bacterial operon encoded biological pathways and complexes have contributed to bacterial adaptation to and survival in diverse environmental niches and lifestyles.
Methods
Sources of data
Sequences corresponding to experimentally characterized EPS operon loci were obtained from the National Centre for Biotechnology Information (NCBI) reference sequence database [91] (S3 Table). Fully sequenced genomes and associated protein sequences were obtained for 1758 bacteria from the NCBI (Retrieved April 20th 2015) (S7 Table). In this set of genomes the major phyla represented are largely Gram-negative Proteobacteria (47%), followed by Gram-positive Firmicutes (~20%) and Actinobacteria (~10%). For each bacterial strain predicted to possess an EPS operon, metadata corresponding to niche (host-associated or environmental) and lifestyle (pathogenic or non-pathogenic) were collated from literature searches (S8 Table–and is also made available for download from https://github.com/ParkinsonLab/eps_biofilms).
Prediction of EPS operons
To identify putative EPS operons, we applied an iterative HMM-based sequence similarity profiling strategy. For each set of EPS loci, we first constructed a HMM from previously characterized EPS producing bacteria (S3 Table); alignments were constructed using MUSCLE v.3.8.1551 [92], with default settings, from which HMM-models were built using HMMER v.3.1b2 [93], with default settings. As the number of characterized EPS producing species varies greatly by system, each HMM was then used to identify additional EPS loci within the set of 1733 bacterial genomes and build an expanded HMM model for downstream operon prediction. First a protein-BLAST search of reference sequence EPS protein coding sequences was performed against the set of reference+representative completely sequenced bacterial genomes downloaded from NCBI (e-value threshold of < = 1e-5). Next, these putative EPS sequences were subject to a second-round of all-vs-all protein-BLAST searches. These results were then processed using an in-house custom perl script to select non-redundant sequences (percentage-identity of < 97%) in an incremental fashion: starting with the reference EPS sequence used in the first step, its highest scoring non-redundant match was selected and subsequently used to identify the next non-redundant sequence with the highest scoring match, and repeated until a pre-defined number of sequences were selected. To capture a consistent degree of locus sequence diversity and HMM sensitivity for each EPS system, we selected 20 sequences to represent each locus with the expectation that this would provide an adequate lower-bound on sampling potential amino acid substitutions for sites undergoing random mutation, i.e. not-under functional constraints. Using these new sets of HMMs, sets of EPS loci for the reconstruction of EPS operons (see below) were predicted through sequence similarity searches of the 1733 genomes using HMMER, with default settings. Significant sequence matches were defined as those with E-values < = 1e-5. HMM models as well as the table detailing lifestyles and environmental niches of bacteria used in this study (S8 Table) have been made available online (https://github.com/ParkinsonLab/eps_biofilms).
To reconstruct putative EPS operons from the sets of loci retrieved from our searches, we first retrieved locus start and stop positions for each locus from their RefSeq entry. We then define putative operons using the following two rules: first only loci that occur within a distance of twice the size of a reference EPS operon to other loci are considered; second intergenic distances of individual loci must be < = 5 kb; third putative operons must consist of at least one locus encoding a putative polysaccharide synthase, together with at least one other locus. To detect previously undiscovered loci that may have been missed in the first rounds of HMM searches, predicted loci of reconstructed operons were used to generate expanded locus-specific HMM models and were subjected to an additional round of HMM searches. This process was performed using custom Perl scripts and results in a list of predicted EPS operons identified in each of the 1733 genomes.
Classification of evolutionary events
For each EPS system (cellulose, acetylated cellulose, PNAG, Pel, and alginate), the locus assignments of each reconstructed operon was compared to a defined reference EPS operon compositions and locus ordering (S5 Table) and were classified into the following evolutionary events; 1) locus losses—the total number of reference loci missing or not detected by HMM searches; 2) locus duplications–number of distinct loci appearing as multiple significant hits to the same HMM model < 10 kb apart; 3) locus fusions–the number of loci that were significant hits to two or more reference EPS locus HMM models; 4) operon rearrangements–the number of predicted operons with locus ordering (accounting for transcriptional direction) different from the reference operon; 5) operon duplications–number of predicted operons (as defined above) present in the same genome > = 10 kb apart.
Classification of EPS loci
Systematic classification of each EPS operon family starts with first merging closely related sequences using CD-HIT v.4.6.3 [94] with default settings (using global sequence identity threshold 0.9; word length 5) to generate a non-redundant set of sequences for each family. Multiple sequence alignments (MSAs) were then generated using MUSCLE and trimmed using trimal v.1.2rev59 [95] (using -automated1 setting). The resulting alignment was then used to construct a consensus phylogenetic tree using PhyML v.3 [96], with default parameters (LG substitution model, with 1000 bootstrap replicates). For each consensus tree, pairwise evolutionary distances (defined as the number of expected average number of amino-acid substitutions per site) for all locus protein sequences were extracted using a custom perl script. These evolutionary distances were subsequently used to iteratively generate sets of clusters, with proteins sharing an evolutionary distance less than a defined cutoff (starting at 0 and incrementing by 0.01) placed in the same cluster. This results in the generation of increasingly coarse clusters of sequences with increasing sequence dissimilarity, such that in the final step all sequences are assigned to a single cluster. At this stage, for all possible clusterings three metrics are calculated and summed together to calculate a clustering quality score: (1) proportion of sequences clustered (p) number of sequences clustered / total number of sequences); (2) the average silhouette score (s_avg) [97]:
For each sequence, i, its silhouette score, s(i), is defined as:
Where a(i) = average evolutionary distance (expected number of substitutions per site) i) is the lowest average evolutionary distance to any other cluster of which i is not a member; and (3) Dunn index (DI)[98], for a set of m clusters, its Dunn index, DI, is defined as:
Where DI is the evolutionary distance between clusters i and j and Δc is the size of cluster c. Note that a higher s(i) indicates that a sequence is well matched to other members of its cluster and not well matched to neighbouring clusters. Furthermore, a higher DI indicates clusters that are compact (smaller cluster sizes) and well differentiated (larger inter-cluster distances). Thus, the evolutionary distance cutoff which maximizes p + s_avg + DI is chosen as the optimal phylogenetic clustering for a given set of EPS locus sequences.
In these analyses we have chosen not to incorporate bootstrap support parameters. Bootstrap values have previously been used in phylogenetic-based clustering approaches (particularly for epidemiological studies of pathogen transmission and evolution [99,100]) and provide an important metric in assessing the reliability of phylogenetic tree topologies [101]. In contrast, we define clusters in a topology-agnostic manner by clustering sequences based only on pairwise evolutionary distances. This provides a readily automated approach for defining evolutionary relationships, particularly for sets of sequences that exhibit diverse phylogenetic distributions that are subject to varying evolutionary selection pressures and feature a variety of sequence lengths.
Construction of EPS operon genomic-proximity networks
To visualize evolutionary and genomic organization relationships of predicted EPS operons, genomic proximity networks were generated in which each node represents an individual EPS locus cluster (as defined above), and an edge connecting a pair of nodes represents the average genomic distance (base pairs) between loci represented by each node found in the same genome. Further, nodes are represented as pie-charts indicating phylogenetic distribution of each EPS locus, as defined by NCBI taxonomic classification scheme. Networks were visualized using Cytoscape (version 3.5) [102].
Supporting information
Data Availability
All relevant data are within the manuscript and its Supporting Information files.
Funding Statement
JP and CB-T were supported by grants from the Natural Sciences and Engineering Research Council (RGPIN-2014-06664 & RGPIN-2019-06852; http://www.nserc-crsng.gc.ca/) and the National Institutes of Health (NSERC; R21AI126466; https://www.nih.gov/). This work was also supported in part by grants from the Canadian Institutes of Health Research (CIHR; http://www.cihr-irsc.gc.ca/) (MOP 43998 and FDN154327 to PLH). PLH is a recipient of a Canada Research Chair (http://www.chairs-chaires.gc.ca/home-accueil-eng.aspx). GBW and LSM have been supported by graduate scholarships from NSERC. GBW has been supported by a graduate scholarship from Cystic Fibrosis Canada (https://www.cysticfibrosis.ca/). LSM has been supported by graduate scholarships from the Ontario Graduate Scholarship Program (https://osap.gov.on.ca/OSAPPortal/en/A-ZListofAid/PRDR019245.html), and The Hospital for Sick Children Foundation Student Scholarship Program (http://www.sickkids.ca/Research/StudentandFellowResources/RTC/Training-Programs/restracomp/index.html). Computing resources were provided by the SciNet HPC Consortium; SciNet is funded by: the Canada Foundation for Innovation (https://www.innovation.ca/) under the auspices of Compute Canada (https://www.computecanada.ca/); Ontario Research Fund–Research Excellence (https://www.ontario.ca/page/ontario-research-fund) and the University of Toronto (https://www.utoronto.ca/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. Science. 1999. July 30;285(5428):751–3. 10.1126/science.285.5428.751 [DOI] [PubMed] [Google Scholar]
- 2.Mao X, Ma Q, Zhou C, Chen X, Zhang H, Yang J, et al. DOOR 2.0: presenting operons and their functions through dynamic and integrated views. Nucleic Acids Res. 2014. January;42(D1):D654–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ermolaeva MD, White O, Salzberg SL. Prediction of operons in microbial genomes. Nucleic Acids Res. 2001. March 1;29(5):1216–21. 10.1093/nar/29.5.1216 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Korbel JO, Jensen LJ, von Mering C, Bork P. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat Biotechnol. 2004. July 1;22(7):911–7. 10.1038/nbt988 [DOI] [PubMed] [Google Scholar]
- 5.Moreno-Hagelsieb G. Operons Across Prokaryotes: Genomic Analyses and Predictions 300+ Genomes Later. Curr Genomics. 2006. May 1;7(3):163–70. [Google Scholar]
- 6.Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A. 1999. March 16;96(6):2896–901. 10.1073/pnas.96.6.2896 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Serres MH, Kerr ARW, McCormack TJ, Riley M. Evolution by leaps: gene duplication in bacteria. Biol Direct. 2009. November 23;4(1):46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wapinski I, Pfeffer A, Friedman N, Regev A. Natural history and evolutionary principles of gene duplication in fungi. Nature. 2007. September 6;449(7158):54–61. 10.1038/nature06107 [DOI] [PubMed] [Google Scholar]
- 9.Ling X, He X, Xin D. Detecting gene clusters under evolutionary constraint in a large number of genomes. Bioinformatics. 2009. March 1;25(5):571–7. 10.1093/bioinformatics/btp027 [DOI] [PubMed] [Google Scholar]
- 10.Ream DC, Bankapur AR, Friedberg I. An event-driven approach for studying gene block evolution in bacteria. Bioinformatics. 2015. July 1;31(13):2075–83. 10.1093/bioinformatics/btv128 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Saurin W, Hofnung M, Dassa E. Getting in or out: early segregation between importers and exporters in the evolution of ATP-binding cassette (ABC) transporters. J Mol Evol. 1999. January;48(1):22–41. 10.1007/pl00006442 [DOI] [PubMed] [Google Scholar]
- 12.Ranea JAG, Buchan DWA, Thornton JM, Orengo CA. Evolution of protein superfamilies and bacterial genome size. J Mol Biol. 2004. February 27;336(4):871–87. 10.1016/j.jmb.2003.12.044 [DOI] [PubMed] [Google Scholar]
- 13.Bratlie MS, Johansen J, Sherman BT, Huang DW, Lempicki RA, Drabløs F. Gene duplications in prokaryotes can be associated with environmental adaptation. BMC Genomics. 2010. October 20;11(1):588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Pereira SB, Mota R, Santos CL, De Philippis R, Tamagnini P. Assembly and export of extracellular polymeric substances (EPS) in cyanobacteria. A phylogenomic approach In: Advances in Botanical Research. Academic Press Inc; 2013. p. 235–79. [Google Scholar]
- 15.Cuthbertson L, Mainprize IL, Naismith JH, Whitfield C. Pivotal Roles of the Outer Membrane Polysaccharide Export and Polysaccharide Copolymerase Protein Families in Export of Extracellular Polysaccharides in Gram-Negative Bacteria. Microbiol Mol Biol Rev. 2009. March 1;73(1):155–77. 10.1128/MMBR.00024-08 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sun S, Berg OG, Roth JR, Andersson DI. Contribution of gene amplification to evolution of increased antibiotic resistance in Salmonella typhimurium. Genetics. 2009. August;182(4):1183–95. 10.1534/genetics.109.103028 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Eisen JA. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998. March;8(3):163–7. 10.1101/gr.8.3.163 [DOI] [PubMed] [Google Scholar]
- 18.Zmasek CM, Eddy SR. A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics. 2001. September;17(9):821–8. 10.1093/bioinformatics/17.9.821 [DOI] [PubMed] [Google Scholar]
- 19.Chen F, Mackey AJ, Stoeckert CJ, Roos DS. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006. January 1;34(Database issue):D363–8. 10.1093/nar/gkj123 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Huerta-Cepas J, Szklarczyk D, Forslund K, Cook H, Heller D, Walter MC, et al. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 2016. January 4;44(D1):D286–93. 10.1093/nar/gkv1248 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lazareva-Ulitsky B, Diemer K, Thomas PD. On the quality of tree-based protein classification. Bioinformatics. 2005. May 1;21(9):1876–90. 10.1093/bioinformatics/bti244 [DOI] [PubMed] [Google Scholar]
- 22.Brown DP, Krishnamurthy N, Sjölander K. Automated protein subfamily identification and classification. PLoS Comput Biol. 2007. August;3(8):e160 10.1371/journal.pcbi.0030160 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Costa EP, Vens C, Blockeel H. Top-down clustering for protein subfamily identification. Evol Bioinform Online. 2013;9:185–202. 10.4137/EBO.S11609 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kelil A, Wang S, Brzezinski R, Fleury A. CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinformatics. 2007. August 4;8(1):286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Handl J, Knowles J, Kell DB. Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005. August 1;21(15):3201–12. 10.1093/bioinformatics/bti517 [DOI] [PubMed] [Google Scholar]
- 26.Altenhoff AM, Škunca N, Glover N, Train C-M, Sueki A, Piližota I, et al. The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements. Nucleic Acids Res. 2015. January 28;43(Database issue):D240–9. 10.1093/nar/gku1158 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Tatusov RL, Galperin MY, Natale DA, Koonin E V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000. January 1;28(1):33–6. 10.1093/nar/28.1.33 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Abby SS, Rocha EPC. The Non-Flagellar Type III Secretion System Evolved from the Bacterial Flagellum and Diversified into Host-Cell Adapted Systems. Achtman M, editor. PLoS Genet. 2012. September 27;8(9):e1002983 10.1371/journal.pgen.1002983 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Denise R, Abby SS, Rocha EPC. Diversification of the type IV filament superfamily into machines for adhesion, protein secretion, DNA uptake, and motility. PLoS Biol. 2019. July 1;17(7). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Whitney JC, Howell PL. Synthase-dependent exopolysaccharide secretion in Gram-negative bacteria. Trends Microbiol. 2013. February;21(2):63–72. 10.1016/j.tim.2012.10.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Castiblanco LF, Sundin GW. Cellulose production, activated by cyclic di-GMP through BcsA and BcsZ, is a virulence factor and an essential determinant of the three-dimensional architectures of biofilms formed by Erwinia amylovora Ea1189. Mol Plant Pathol. 2018. January;19(1):90–103. 10.1111/mpp.12501 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Franklin MJ, Nivens DE, Weadge JT, Howell PL. Biosynthesis of the Pseudomonas aeruginosa Extracellular Polysaccharides, Alginate, Pel, and Psl. Front Microbiol. 2011;2:167 10.3389/fmicb.2011.00167 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ates O. Systems Biology of Microbial Exopolysaccharides Production. Front Bioeng Biotechnol. 2015;3:200 10.3389/fbioe.2015.00200 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Low KE, Howell PL. Gram-negative synthase-dependent exopolysaccharide biosynthetic machines. Curr Opin Struct Biol. 2018. December 1;53:32–44. 10.1016/j.sbi.2018.05.001 [DOI] [PubMed] [Google Scholar]
- 35.Lawrence J. Selfish operons: the evolutionary impact of gene clustering in prokaryotes and eukaryotes. Curr Opin Genet Dev. 1999. December;9(6):642–8. 10.1016/s0959-437x(99)00025-8 [DOI] [PubMed] [Google Scholar]
- 36.Römling U, Galperin MY. Bacterial cellulose biosynthesis: diversity of operons, subunits, products, and functions. Trends Microbiol. 2015. September;23(9):545–57. 10.1016/j.tim.2015.05.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Friedman L, Kolter R. Genes involved in matrix formation in Pseudomonas aeruginosa PA14 biofilms. Mol Microbiol. 2004. February 16;51(3):675–90. 10.1046/j.1365-2958.2003.03877.x [DOI] [PubMed] [Google Scholar]
- 38.Ross P, Weinhouse H, Aloni Y, Michaeli D, Weinberger-Ohana P, Mayer R, et al. Regulation of cellulose synthesis in Acetobacter xylinum by cyclic diguanylic acid. Nature. 1987. January;325(6101):279–81. 10.1038/325279a0 [DOI] [PubMed] [Google Scholar]
- 39.Wang X, Preston JF, Romeo T. The pgaABCD locus of Escherichia coli promotes the synthesis of a polysaccharide adhesin required for biofilm formation. J Bacteriol. 2004. May;186(9):2724–34. 10.1128/JB.186.9.2724-2734.2004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Vasseur P, Vallet-Gely I, Soscia C, Genin S, Filloux A. The pel genes of the Pseudomonas aeruginosa PAK strain are involved at early and late stages of biofilm formation. Microbiology. 2005. March 1;151(Pt 3):985–97. 10.1099/mic.0.27410-0 [DOI] [PubMed] [Google Scholar]
- 41.Spiers AJ, Bohannon J, Gehrig SM, Rainey PB. Biofilm formation at the air-liquid interface by the Pseudomonas fluorescens SBW25 wrinkly spreader requires an acetylated form of cellulose. Mol Microbiol. 2003. October 14;50(1):15–27. 10.1046/j.1365-2958.2003.03670.x [DOI] [PubMed] [Google Scholar]
- 42.Omelchenko M V, Makarova KS, Wolf YI, Rogozin IB, Koonin E V. Evolution of mosaic operons by horizontal gene transfer and gene displacement in situ. Genome Biol. 2003;4(9):R55 10.1186/gb-2003-4-9-r55 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Koonin E V., Makarova KS, Aravind L. Horizontal Gene Transfer in Prokaryotes: Quantification and Classification. Annu Rev Microbiol. 2001. October;55(1):709–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Hu S-Q, Gao Y-G, Tajima K, Sunagawa N, Zhou Y, Kawano S, et al. Structure of bacterial cellulose synthase subunit D octamer with four inner passageways. Proc Natl Acad Sci U S A. 2010. October 19;107(42):17957–61. 10.1073/pnas.1000601107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Ji K, Wang W, Zeng B, Chen S, Zhao Q, Chen Y, et al. Bacterial cellulose synthesis mechanism of facultative anaerobe Enterobacter sp. FY-07. Sci Rep. 2016. February 25;6(1):21863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Wells JN, Bergendahl LT, Marsh JA. Operon Gene Order Is Optimized for Ordered Protein Complex Assembly. Cell Rep. 2016. February 2;14(4):679–85. 10.1016/j.celrep.2015.12.085 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Zaslaver A, Mayo A, Ronen M, Alon U. Optimal gene partition into operons correlates with gene functional order. Phys Biol. 2006. September 18;3(3):183–9. 10.1088/1478-3975/3/3/003 [DOI] [PubMed] [Google Scholar]
- 48.Marmont LS, Whitfield GB, Rich JD, Yip P, Giesbrecht LB, Stremick CA, et al. PelA and PelB proteins form a modification and secretion complex essential for Pel polysaccharide-dependent biofilm formation in Pseudomonas aeruginosa. J Biol Chem. 2017. November 24;292(47):19411–22. 10.1074/jbc.M117.812842 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Cue D, Lei MG, Lee CY. Genetic regulation of the intercellular adhesion locus in staphylococci. Front Cell Infect Microbiol. 2012;2:38 10.3389/fcimb.2012.00038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Purcell EB, Tamayo R. Cyclic diguanylate signaling in Gram-positive bacteria. Shen A, editor. FEMS Microbiol Rev. 2016. September;40(5):753–73. 10.1093/femsre/fuw013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Whitfield GB, Marmont LS, Howell PL. Enzymatic modifications of exopolysaccharides enhance bacterial persistence. Front Microbiol. 2015. May 15;6:471 10.3389/fmicb.2015.00471 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Echeverz M, García B, Sabalza A, Valle J, Gabaldón T, Solano C, et al. Lack of the PGA exopolysaccharide in Salmonella as an adaptive trait for survival in the host. Casadesús J, editor. PLoS Genet. 2017. May 24;13(5):e1006816 10.1371/journal.pgen.1006816 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Crossman LC, Chaudhuri RR, Beatson SA, Wells TJ, Desvaux M, Cunningham AF, et al. A commensal gone bad: complete genome sequence of the prototypical enterotoxigenic Escherichia coli strain H10407. J Bacteriol. 2010. November;192(21):5822–31. 10.1128/JB.00710-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Ofek I, Zafriri D, Goldhar J, Eisenstein BI. Inability of toxin inhibitors to neutralize enhanced toxicity caused by bacteria adherent to tissue culture cells. Infect Immun. 1990. November;58(11):3737–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Madhavan TPV, Sakellaris H. Colonization factors of enterotoxigenic Escherichia coli. Adv Appl Microbiol. 2015;90:155–97. 10.1016/bs.aambs.2014.09.003 [DOI] [PubMed] [Google Scholar]
- 56.Xu D, Zhang W, Zhang B, Liao C, Shao Y. Characterization of a biofilm-forming Shigella flexneri phenotype due to deficiency in Hep biosynthesis. PeerJ. 2016. July 14;4:e2178 10.7717/peerj.2178 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Sims GE, Kim S-H. Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proc Natl Acad Sci U S A. 2011. May 17;108(20):8329–34. 10.1073/pnas.1105168108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Jiang K, Sanseverino J, Chauhan A, Lucas S, Copeland A, Lapidus A, et al. Complete genome sequence of Thauera aminoaromatica strain MZ1T. Stand Genomic Sci. 2012. July 30;6(3):325–35. 10.4056/sigs.2696029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Prombutara P, Allen MS. Flocculation-Related Gene Identification by Whole-Genome Sequencing of Thauera aminoaromatica MZ1T Floc-Defective Mutants. Stams AJM, editor. Appl Environ Microbiol. 2015. December 28;82(6):1646–52. 10.1128/AEM.02917-15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Fata Moradali M, Donati I, Sims IM, Ghods S, Rehm BHA. Alginate Polymerization and Modification Are Linked in Pseudomonas aeruginosa. MBio. 2015. May 12;6(3):e00453–15. 10.1128/mBio.00453-15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Morgan JLW, McNamara JT, Fischer M, Rich J, Chen H-M, Withers SG, et al. Observing cellulose biosynthesis and membrane translocation in crystallo. Nature. 2016. March 17;531(7594):329–34. 10.1038/nature16966 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Morgan JLW, Strumillo J, Zimmer J. Crystallographic snapshot of cellulose synthesis and membrane translocation. Nature. 2013. January 10;493(7431):181–6. 10.1038/nature11744 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Du J, Vepachedu V, Cho SH, Kumar M, Nixon BT. Structure of the Cellulose Synthase Complex of Gluconacetobacter hansenii at 23.4 Å Resolution. Lai H-C, editor. PLoS One. 2016. May 23;11(5):e0155886 10.1371/journal.pone.0155886 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Krasteva PV, Bernal-Bayard J, Travier L, Martin FA, Kaminski P-A, Karimova G, et al. Insights into the structure and assembly of a bacterial cellulose secretion system. Nat Commun. 2017. December 12;8(1):2065 10.1038/s41467-017-01523-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Heilmann C, Schweitzer O, Gerke C, Vanittanakom N, Mack D, Götz F. Molecular basis of intercellular adhesion in the biofilm-forming Staphylococcus epidermidis. Mol Microbiol. 1996. June;20(5):1083–91. 10.1111/j.1365-2958.1996.tb02548.x [DOI] [PubMed] [Google Scholar]
- 66.Itoh Y, Rice JD, Goller C, Pannuri A, Taylor J, Meisner J, et al. Roles of pgaABCD genes in synthesis, modification, and export of the Escherichia coli biofilm adhesin poly-beta-1,6-N-acetyl-D-glucosamine. J Bacteriol. 2008. May 15;190(10):3670–80. 10.1128/JB.01920-07 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Atkin KE, MacDonald SJ, Brentnall AS, Potts JR, Thomas GH. A different path: Revealing the function of staphylococcal proteins in biofilm formation. FEBS Lett. 2014. May 21;588(10):1869–72. 10.1016/j.febslet.2014.04.002 [DOI] [PubMed] [Google Scholar]
- 68.Gerke C, Kraft A, Süßmuth R, Schweitzer O, Götz F. Characterization of the N -Acetylglucosaminyltransferase Activity Involved in the Biosynthesis of the Staphylococcus epidermidis Polysaccharide Intercellular Adhesin. J Biol Chem. 1998. July 17;273(29):18586–93. 10.1074/jbc.273.29.18586 [DOI] [PubMed] [Google Scholar]
- 69.Holland LM, O’Donnell ST, Ryjenkov DA, Gomelsky L, Slater SR, Fey PD, et al. A staphylococcal GGDEF domain protein regulates biofilm formation independently of cyclic dimeric GMP. J Bacteriol. 2008. August 1;190(15):5178–89. 10.1128/JB.00375-08 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Vuong C, Kocianova S, Voyich JM, Yao Y, Fischer ER, DeLeo FR, et al. A crucial role for exopolysaccharide modification in bacterial biofilm formation, immune evasion, and virulence. J Biol Chem. 2004. December 24;279(52):54881–6. 10.1074/jbc.M411374200 [DOI] [PubMed] [Google Scholar]
- 71.Little DJ, Pfoh R, Le Mauff F, Bamford NC, Notte C, Baker P, et al. PgaB orthologues contain a glycoside hydrolase domain that cleaves deacetylated poly-β(1,6) -N-acetylglucosamine and can disrupt bacterial biofilms. Skurnik D, editor. PLOS Pathog. 2018. April 23;14(4):e1006998 10.1371/journal.ppat.1006998 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Wang Y, Andole Pannuri A, Ni D, Zhou H, Cao X, Lu X, et al. Structural Basis for Translocation of a Biofilm-supporting Exopolysaccharide across the Bacterial Outer Membrane. J Biol Chem. 2016. May 6;291(19):10046–57. 10.1074/jbc.M115.711762 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Little DJ, Poloczek J, Whitney JC, Robinson H, Nitz M, Howell PL. The structure- and metal-dependent activity of Escherichia coli PgaB provides insight into the partial de-N-acetylation of poly-β-1,6-N-acetyl-D-glucosamine. J Biol Chem. 2012. September 7;287(37):31126–37. 10.1074/jbc.M112.390005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Little DJ, Bamford NC, Pokrovskaya V, Robinson H, Nitz M, Howell PL. Structural basis for the De-N-acetylation of Poly-β-1,6-N-acetyl-D-glucosamine in Gram-positive bacteria. J Biol Chem. 2014. December 26;289(52):35907–17. 10.1074/jbc.M114.611400 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Tocheva EI, Ortega DR, Jensen GJ. Sporulation, bacterial cell envelopes and the origin of life. Nat Rev Microbiol. 2016. August 27;14(8):535–42. 10.1038/nrmicro.2016.85 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic Acids Res. 2014. January;42(Database issue):D222–30. 10.1093/nar/gkt1223 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Li H, Coghlan A, Ruan J, Coin LJ, Hériché J-K, Osmotherly L, et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 2006. January 1;34(Database issue):D572–80. 10.1093/nar/gkj118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Gori K, Suchan T, Alvarez N, Goldman N, Dessimoz C. No Title. 2016. June 1;33(6). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Eisen JA, Sweder KS, Hanawalt PC. Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions. Nucleic Acids Res. 1995. July 25;23(14):2715–23. 10.1093/nar/23.14.2715 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Wasmuth JD, Pszenny V, Haile S, Jansen EM, Gast AT, Sher A, et al. Integrated bioinformatic and targeted deletion analyses of the SRS gene superfamily identify SRS29C as a negative regulator of Toxoplasma virulence. MBio. 2012. November 13;3(6):e00321-12–e00321-12. 10.1128/mBio.00321-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Huynen M, Snel B, Lathe W, Bork P. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 2000. August;10(8):1204–10. 10.1101/gr.10.8.1204 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Fang G, Rocha EPC, Danchin A. Persistence drives gene clustering in bacterial genomes. BMC Genomics. 2008. January 7;9(1):4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Junier I, Rivoire O. Conserved Units of Co-Expression in Bacterial Genomes: An Evolutionary Insight into Transcriptional Regulation. Moreno-Hagelsieb G, editor. PLoS One. 2016. May 19;11(5):e0155740 10.1371/journal.pone.0155740 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Ausmees N, Jonsson H, Hoglund S, Ljunggren H, Lindberg M. Structural and putative regulatory genes involved in cellulose synthesis in Rhizobium leguminosarum bv. trifolii. Microbiology. 1999. May 1;145(5):1253–62. [DOI] [PubMed] [Google Scholar]
- 85.Castiblanco LF, Sundin GW. Cellulose production, activated by cyclic di-GMP through BcsA and BcsZ, is a virulence factor and an essential determinant of the three-dimensional architectures of biofilms formed by Erwinia amylovora Ea1189. Mol Plant Pathol. 2018. January 1;19(1):90–103. 10.1111/mpp.12501 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Ahmad I, Rouf SF, Sun L, Cimdins A, Shafeeq S, Le Guyon S, et al. BcsZ inhibits biofilm phenotypes and promotes virulence by blocking cellulose production in Salmonella enterica serovar Typhimurium. Microb Cell Fact. 2016. October 19;15(1):177 10.1186/s12934-016-0576-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Sajadi E, Babaipour V, Deldar AA, Yakhchali B, Fatemi SS-A. Enhancement of crystallinity of cellulose produced by Escherichia coli through heterologous expression of bcsD gene from Gluconacetobacter xylinus. Biotechnol Lett. 2017. September 1;39(9):1395–401. 10.1007/s10529-017-2366-6 [DOI] [PubMed] [Google Scholar]
- 88.Sandegren L, Andersson DI. Bacterial gene amplification: implications for the evolution of antibiotic resistance. Nat Rev Microbiol. 2009. August;7(8):578–88. 10.1038/nrmicro2174 [DOI] [PubMed] [Google Scholar]
- 89.Jahn CE, Selimi DA, Barak JD, Charkowski AO. The Dickeya dadantii biofilm matrix consists of cellulose nanofibres, and is an emergent property dependent upon the type III secretion system and the cellulose synthesis operon. Microbiology. 2011. October 1;157(Pt 10):2733–44. 10.1099/mic.0.051003-0 [DOI] [PubMed] [Google Scholar]
- 90.MacKenzie KD, Palmer MB, Köster WL, White AP. Examining the Link between Biofilm Formation and the Ability of Pathogenic Salmonella Strains to Colonize Multiple Host Species. Front Vet Sci. 2017. August 25;4:138 10.3389/fvets.2017.00138 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Tatusova T, Ciufo S, Fedorov B, O’Neill K, Tolstoy I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2015. April 20;43(7):3872–3872. 10.1093/nar/gkv278 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004. March 8;32(5):1792–7. 10.1093/nar/gkh340 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Eddy SR. Accelerated Profile HMM Searches. Pearson WR, editor. PLoS Comput Biol. 2011. October 20;7(10):e1002195 10.1371/journal.pcbi.1002195 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012. December 1;28(23):3150–2. 10.1093/bioinformatics/bts565 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009. August 1;25(15):1972–3. 10.1093/bioinformatics/btp348 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Guindon S, Delsuc F, Dufayard J-F, Gascuel O. Estimating Maximum Likelihood Phylogenies with PhyML. In: Methods in molecular biology (Clifton, NJ). 2009. p. 113–37. [DOI] [PubMed] [Google Scholar]
- 97.Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987. November 1;20:53–65. [Google Scholar]
- 98.Dunn JC. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. J Cybern. 1973. January;3(3):32–57. [Google Scholar]
- 99.Ragonnet-Cronin M, Hodcroft E, Hué S, Fearnhill E, Delpech V, Brown AJL, et al. Automated analysis of phylogenetic clusters. BMC Bioinformatics. 2013. November 6;14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Prosperi MCF, De Luca A, Di Giambenedetto S, Bracciale L, Fabbiani M, Cauda R, et al. The Threshold Bootstrap Clustering: A New Approach to Find Families or Transmission Clusters within Molecular Quasispecies. Poon AFY, editor. PLoS One. 2010. October 25;5(10):e13619 10.1371/journal.pone.0013619 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Efron B, Halloran E, Holmes S. Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci. 1996. November 12;93(23):13429–13429. 10.1073/pnas.93.23.13429 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Killcoyne S, Carter GW, Smith J, Boyle J. Cytoscape: a community-based framework for network modeling. Methods Mol Biol. 2009;563:219–39. 10.1007/978-1-60761-175-2_12 [DOI] [PubMed] [Google Scholar]
- 103.Price MN, Dehal PS, Arkin AP. FastTree 2—Approximately maximum-likelihood trees for large alignments. PLoS One. 2010. March 10;5(3). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Letunic I, Bork P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 2019. July 2;47(W1):W256–9. 10.1093/nar/gkz239 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Conway JR, Lex A, Gehlenborg N. UpSetR: an R package for the visualization of intersecting sets and their properties. Hancock J, editor. Bioinformatics. 2017. September 15;33(18):2938–40. 10.1093/bioinformatics/btx364 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, et al. UCSF Chimera—a visualization system for exploratory research and analysis. J Comput Chem. 2004. October;25(13):1605–12. 10.1002/jcc.20084 [DOI] [PubMed] [Google Scholar]