Abstract
Gene redundancy, increasing gene dosage and functional diversity, remains understudied regarding its roles in evolution and clinical infections. Exploring 22,310 prokaryotic genomes with our custom pipeline, we found that redundant genes, though less frequent than in eukaryotes, are widespread and mainly linked to niche specialization. Evolutionary analyses delineated a propensity for gene redundancy expansion with increasing phylogenetic distance, with pathogens accumulating more redundancy than non-pathogens. Redundant genes are always co-duplicated with translation initiation signals and potentially preserve functionality. Time series examining 69 Acinetobacter baumannii isolates from multi-sites in severely infected patients, we identified redundant alcohol dehydrogenase genes and translation initiation signals, introduced by gene islands, as advantageous for invasive urinary tract infection throughout its within-patient development and cross-patient transmission. Mouse peritoneal infection models and plasmid transformation experiments confirmed that the redundancy of frmA, an alcohol dehydrogenase gene, is linked to both enhanced virulence and increased biofilm mass in Acinetobacter baumannii. Additional analysis of 898 Enterobacter cloacae complex genomes revealed that redundant metal ion resistance genes carried by mobile genetic elements may provide selective advantages. This study unveiled a moderate gene redundancy within prokaryotes, providing genetic insights into the adaptive evolution and clinical infection of pathogens.
Subject terms: Evolutionary genetics, Pathogens, Clinical genetics, Evolutionary biology, Bacterial infection
Gene redundancy in prokaryotes remains understudied. This analysis shows redundant genes are widespread, with pathogens accumulating more than non-pathogens, and are linked to virulence of Acinetobacter baumannii.
Introduction
Gene redundancy refers to the occurrence of more than one copy of the same gene within a genome. It is of significant interest for eukaryotes, with observed genome-wide duplication rates ranging from 10–7 to as high as 10–5 duplications per gene per generation at specific loci in multicellular eukaryotes1. These complex genome organizations in eukaryotes not only contribute to phenotypic diversity but also facilitate the development of innovative therapeutic approaches2. While gene redundancy in eukaryotes is relatively well-documented, its characteristics in prokaryotes are still understudied in comparison. The advancement of high-throughput sequencing technology, coupled with an enhanced understanding of the importance of microorganisms, has catalyzed recent investigations into gene redundancy in prokaryotes3,4. These investigations have highlighted the crucial roles of gene redundancy in prokaryotes for fostering innovation, increasing mutation tolerance, and facilitating environmental adaptation5–8. Additionally, recent research has observed that redundant genes are abundant in bacterial clinical isolates9, especially plasmids carrying duplicated genes, which contribute to bacterial evolution10. However, most of them are observational descriptions of specific prokaryotes, genes, or genomic regions. It is still a need to explore the redundancy in prokaryotes on a large scale, encompassing its evolutionary characteristics and roles in clinical infections.
Gene redundancy has long been recognized as a primary source for the emergence of evolutionary novelties, contributing to at least 50% of prokaryotic and over 90% of eukaryotic genes11. Subsequent evolutionary divergence may arise in redundant genes, particularly under changing environmental conditions12,13. Recent investigations into Nitrososphaerales have highlighted that the fate of redundant genes is lineage-specific; some are susceptible to loss in several lineages, whereas others are well preserved and experience wide duplication, indicating niche-specific roles14. The fates of preserved redundant genes are partly influenced by the nature of mutations, leading to outcomes including evolving into pseudogenes (pseudogenization), acquiring novel functions (neofunctionalization), or partitioning ancestral functions (subfunctionalization)13. Translation initiation signals might co-duplicate with genes and subsequently undergo comparable evolutionary trajectories alongside them, yet remain inadequately investigated in prokaryotes. Studies in eukaryotes, such as Saccharomyces cerevisiae, have demonstrated a tendency for regulatory elements, including translation initiation signals, to be co-duplicated with the corresponding genes11. Nevertheless, the variability among different duplication mechanisms is significant; for example, copies arising from transposon-mediated duplication, tandem duplication, and retroduplication may lack regulatory elements, such as translation initiation signals15. Therefore, although models suggest the pseudogenization of one copy as one of the most common fates of redundant genes13, it is unclear how frequently redundant genes navigate an evolutionary trajectory from formation to loss of function in prokaryotes, nor is it clear the possible co-duplication process of translation initiation signals and their corresponding genes, as well as any subsequent potential evolutionary divergence among multiple copies in prokaryotes.
Redundant genes in prokaryotic pathogenic genomes, particularly those resulting from horizontal gene transfer (HGT), are frequently associated with microbial pathogenicity and resistance16,17. Research on Escherichia coli, for instance, has identified different patterns of redundant genes between pathogenic and non-pathogenic strains, with some redundant genes being pathotype-specific18. In pathogenic Leptospira, HGT-derived redundant genes have not only facilitated the acquisition of virulence factors but also significantly expanded some virulence-related protein families, such as metalloproteases-associated paralogs19. Moreover, the selective pressure exerted by antibiotic exposure has heightened the mobilization and transfer of antibiotic resistance genes (ARGs) to pathogenic bacteria, leading to the well-known issue of ineffective infection treatments20. Recent research has also revealed that redundant ARGs are enriched in bacterial isolates from humans and livestock, environments heavily associated with antibiotic use9. These studies have mainly focused on virulence and resistance genes, but lack comprehensive insights into other redundant genes that play crucial roles in the evolution of virulence and resistance, especially during clinical infections. Therefore, an exhaustive exploration of the impact of redundant genes by directly correlating genomic data with clinical phenotypes is imperative. Such exploration is particularly crucial for ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter spp.), known for their ability to ‘escape’ antibiotics and for being the leading causes of hospital-acquired infections21,22.
In this study, we comprehensively investigated gene redundancy across a dataset of 22,310 prokaryotic genomes using our newly developed pipeline, SoDpipe (available at https://github.com/Peihw/prokaryotes-g-e). We delved into the characteristics, duplication mechanisms, and evolutionary fate of redundant genes and their translation initiation signals in prokaryotes, drawing comparisons to model eukaryotic organisms. We found that redundant genes are highly prevalent in prokaryotic genomes, with functions primarily associated with niche specialization. These redundant genes often co-duplicated with their translation initiation signals to ensure functionality. Evolutionary analyses indicated a tendency for gene redundancy to expand with increasing phylogenetic distance, with pathogens accumulating more evolutionary divergence and redundancy than non-pathogens. Additionally, we explored the roles of gene redundancy in pathogen evolution and clinical infections by a time series analysis of redundant genes across 69 A. baumannii strains collected from multiple sites of severely infected patients and environmental surfaces, along with a thorough investigation of 898 publicly available Enterobacter cloacae complex (ECC) strains. Our findings, further supported by experimental validation, demonstrated that the gene redundancy of interest is linked to the enhanced virulence and increased biofilm mass. This study may enrich the global understanding of gene redundancy in prokaryotes, providing evolutionary insights and emphasizing its importance in clinical infections of pathogens.
Results
Redundant genes are prevalent in prokaryotic genomes, with pathogens exhibiting higher redundancy ratios compared to non-pathogens
To effectively identify redundant genes in prokaryotic genomes on a large scale, we developed SoDpipe, a pipeline tailored for the automated analysis of redundant genes and their translation initiation signals in prokaryotes (see Methods, Supplementary Information, and Supplementary Fig. 1). We employed SoDpipe on 22,310 complete prokaryotic genomes from RefSeq, belonging to 6,450 species (Supplementary Data 1). We applied two criteria, widely used and consistent with previous studies23,24, to identify redundant genes, resulting in two datasets: Set-80 (80% identity and 80% coverage) and Set-100 (100% identity and 95% coverage). Set-100 comprised identical redundant genes most likely derived from recent duplications, while Set-80 may also include xenogeneic genes derived from HGT23. In total, we identified 2,432,287 redundant genes in 21,842 genomes, forming 800,486 redundant clusters in Set-80, meanwhile 1,042,374 redundant genes in 20,138 genomes, forming 307,360 clusters in Set-100. Additionally, we found a strong positive correlation between the number of redundant genes and prokaryotic genome size (Supplementary Fig. 2; Set-80: p-value < 0.001, r = 0.534; Set-100: p-value < 0.001, r = 0.246), supported by previous research7.
Given the compact and efficient nature of prokaryotic genomes, the prevalence of redundant genes within a bacterial genome may initially appear counterintuitive8. Yet, our analysis indicated that redundant genes were prevalent across most phyla of prokaryotic genomes, with their presence observed in 97.9% of genomes in Set-80 and 90.3% in Set-100 (Fig. 1a and Supplementary Fig. 3a). Considering the inherent variability in prokaryotic genome size and our analysis showing that prokaryotic genome size increases with the number of redundant genes, we have defined the redundancy ratio as the number of redundant genes divided by the number of protein-coding genes. As a result, the redundancy ratio in each prokaryotic genome showed large variability, ranging from 0% to 55.5%, with an average of 2.9% for Set-80 (the average redundancy ratio in Set-100 was 1.4%; Supplementary Data 1). Notably, after excluding phyla represented by only one genome, Tenericutes exhibited the highest average redundancy ratio, reaching 5.0% in Set-80 and 3.1% in Set-100, followed by Planctomycetes, Cyanobacteria, and Fusobacteria.
Fig. 1. The distribution of redundancy ratios in 22,310 complete prokaryotic genomes in Set-80.
a The redundancy ratios for each phylum (n ≥ 5). The lower triangle indicates the average redundancy ratio. b The redundancy ratios for pathogens (n = 233) and non-pathogens (n = 64) in E. coli strains (p-value < 2.2e-16, Wilcoxon rank-sum test, two-sided). c The redundancy ratios for pathogens and non-pathogens in each phylum. The exact number and percentage of genomes with redundant genes are detailed in parentheses. Two-sided Wilcoxon rank-sum tests were performed, and p-values were adjusted for multiple testing using the false discovery rate method. d The distribution of redundancy ratios in prokaryotes and the ratio of three model eukaryotic organisms are represented as dashed lines. The percentage of prokaryotic genomes with a smaller redundancy ratio than each eukaryotic organism is indicated, separated by commas. Box plots indicate median (middle line), 25th, 75th percentile (box) and 5th and 95th percentile (whiskers) as well as outliers (single points). Source data are provided as a Source Data file. * p-value < 0.05; ** p-value < 0.01; *** p-value < 0.001; **** p-value < 0.0001.
To assess the difference in gene redundancy between pathogens and non-pathogens, we annotated the pathogenicity of each microbe (see Methods) and discovered that pathogens had a significantly higher redundancy ratio than non-pathogens (p-value < 0.001, Wilcoxon rank-sum test). Specifically, in Set-80, the redundancy ratio for pathogens was 3.3 ± 3.4%, compared to 2.3 ± 2.3% in non-pathogens. Similarly, in Set-100, the ratio was 1.7 ± 2.3% for pathogens and 1.0 ± 1.5% for non-pathogens, respectively. This conclusion is supported by observations at the strain level showing a significantly higher redundancy ratio in pathogenic E. coli strains (n = 233; 8.7 ± 3.1% in Set-80, 3.4 ± 1.8% in Set-100) compared to their non-pathogenic strains (n = 63; 3.1 ± 1.6% in Set-80, 1.2 ± 1.3% in Set-100; p-value < 0.001, Wilcoxon rank-sum test; Fig. 1b and Supplementary Fig. 3b). This suggested that some redundant genes were pathotype-specific and strain-specific, consistent with previous studies on 38 E. coli strains18. We found that, with the exception of the Chlamydiae phylum, pathogens consistently exhibited a higher redundancy ratio compared to non-pathogens across all phyla (Fig. 1c and Supplementary Fig. 3c). We further annotated functions of the redundant genes with the databases of COG (Clusters of Orthologous Groups)25 and found that pathogenic bacteria contained more redundant genes involved in ‘nucleotide transport and metabolism’ (class F, pathogens vs. non-pathogens: 0.8 ± 1.8% vs. 0.7 ± 2.7%, p-valueadj < 0.05, Wilcoxon rank-sum test, Supplementary Data 2), ‘secondary metabolite transport and metabolism’ (class Q, 1.4 ± 3.4% vs. 1.3 ± 2.8%, p-valueadj < 0.05), ‘energy production and conversion’ (class C, 4.5 ± 7.3% vs. 3.9 ± 6.5%, p-valueadj < 0.05), and ‘extracellular structures’ (class W, 0.45 ± 2.5% vs. 0.38 ± 1.8%, p-valueadj < 0.05) in Set-80. This observation is similar to findings from the previous study26, which demonstrate that fungal plant pathogens possess additional genes involved in secondary metabolism and in managing interactions with hosts compared to non-pathogens.
To better understand the degree of gene redundancy in prokaryotes, we calculated the redundancy ratio for several eukaryotic model organisms, including S. cerevisiae, Caenorhabditis elegans, and Arabidopsis thaliana. The results reveal that the redundancy ratio accounts for 8.0%, 33.4%, and 41.5% for these three eukaryotes in Set-80, respectively, and for 2.3%, 6.2%, and 14.5% in Set-100. Thus, most prokaryotes exhibit a lower redundancy ratio compared to eukaryotes. Specifically, the redundancy ratio of S. cerevisiae exceeds that of 95.1% of prokaryotic genomes with Set-80 parameters and 82.7% under Set-100 parameters, with the redundancy ratio of C. elegans and A. thaliana exceed 99.9% (95.4% in Set-100) and nearly 100.0% (nearly 100.0% in Set-100) of prokaryotes under Set-80 parameters, respectively (Fig. 1d and Supplementary Fig. 3d).
Redundant genes contribute to niche specialization
To survey the functional distribution of redundant genes in prokaryotes, we annotated all redundant genes and discovered their involvement across all COG functional categories, with a notable predominance in ‘mobilome: prophages, transposons’ (class X) (Fig. 2a and Supplementary Fig. 4a). Class X genes serve as vehicles for transmitting many vital genes associated with adaptive traits, thereby playing a crucial role in bacterial adaptation27. By leveraging comprehensive plasmid annotations in RefSeq, we further observed via mixed-effects linear regression that the redundant gene ratio increased with the number of plasmids (Set-80, p-value = 3.11 × 10−9, the slope coefficient = 1.633 × 10−3; Set-100, p-value = 1.43 × 10−8, the slope coefficient = 9.057 × 10−4). This suggests that mobile genetic elements may actively contribute to the formation of redundant genes. To further assess whether redundant genes tend to be indispensable for survival, we annotated them using the DEG (Database of Essential Genes) database28. The results revealed that, across most phyla, essential genes represented only a minor proportion of redundant genes (Fig. 2b, Supplementary Fig. 4b, Supplementary Data 3). A previous study on E. coli supported the above observations and highlighted that genes involved in fundamental cellular processes are highly essential and less frequently duplicated than those facilitating interactions and adaptation to diverse environments29.
Fig. 2. Functional distribution of redundant genes in Set-80.
a The heatmap presents the proportion of functions of redundant genes in each COG category. The black asterisk indicates a significantly higher proportion of the corresponding category (p-value < 0.05, Fisher’s exact test, two-sided). b The bar plot to the right presents the proportion of essential functions and non-essential functions, annotated by the DEG database. The black asterisk indicates a significantly higher proportion of non-essential than essential functions (p-valueadj < 0.05, Wilcoxon rank-sum test, two-sided, Supplementary Data 3). The hyphen represents all the other redundant genes that do not assign a COG function. c, d Changes in the functional abundance of redundant genes between the Terrestrial (n = 265), Marine (n = 383), and Freshwater (other aquatic environment excludes marine, including freshwater, lake, river, sediment, and sludge; n = 295) groups, presented as COG categories and COG IDs, respectively. The heat map shows the differentially abundant redundant functions, with the color scale calculated using q-values and beta coefficients generated from the general linear model in MaAslin2. The Terrestrial group was set as a reference. The scale indicates the degree of enrichment (red) or depletion (blue) compared to the Terrestrial group, with deeper colors signifying greater significance. e, f Changes in the functional abundance of redundant genes between Human (n = 297), Plant (n = 123), and Bird (n = 113) groups, presented as COG categories and COG IDs, respectively. The Human group was set as a reference. * p-value < 0.05; ** p-value < 0.01.
To advance the understanding of the functional distribution of redundant genes within prokaryotic genomes across different environments, we classified prokaryotes into six groups according to their isolation source (see Methods). By comparing microbes isolated from the Marine, Freshwater, and Terrestrial groups, representing three natural environments, we found that redundant genes in class N ‘Cell Motility’ were significantly more abundant in the Marine group, with genes such as COG1344 being associated with flagellum formation and swimming30 (p-valueadj < 0.05, Fig. 2c, d, Supplementary Data 4). In the Terrestrial group, we observed a significant enrichment of class G ‘Carbohydrate Transport and Metabolism’ (including COG1023, COG1134, and COG0366), which may be related to the nutrient scarcity typically found in aquatic environments31. Additionally, the abundance of redundant genes associated with signal transduction and stress response was significantly higher in the Terrestrial group (including COG2229 and COG2018). The greater environmental variability in terrestrial ecosystems likely necessitates more flexible signal transduction mechanisms that enable rapid adaptation to environmental changes32. The comparison of bacteria among Human, Plant, and Bird groups showed that Class K ‘Transcription’ (including COG1278, a cold shock protein, p-valueadj < 0.05, Fig. 2e, f, Supplementary Data 4) was significantly enriched in isolates from plants. Due to the large diurnal temperature fluctuations and pronounced seasonal climate changes that plants experience, in contrast to the relatively stable body temperatures of humans and birds, these redundant genes may enhance the ability of plant-associated microbes to tolerate cold stress33. Moreover, the typically higher body temperature of birds may impose demands on fatty acid synthesis and lipid metabolism, leading to the redundancy of COG0304 (3-oxoacyl-[acyl-carrier-protein] synthase, Class Q) in isolates from birds34. Our findings suggested that redundant genes might contribute to promoting niche specialization.
Redundant genes exhibit evolutionary expansion, potential functionality, and co-duplication with nearby translation initiation signals
To uncover the evolutionary fate and consequences of redundant genes in prokaryotes, we calculated the phylogenetic distance of each species, based on the relatively complete phylogenetic tree provided by the All-Species Living Tree project35. We then grouped all genomes into 15 bins according to their phylogenetic distances with an interval of 0.05 and calculated the average redundancy ratio for each group. We found a positive linear relationship between the average phylogenetic distance and the average redundancy ratio (Fig. 3a, linear regression: R2 = 0.62, y = -0.286 + 5.62x, p-value < 0.05 in Set-80; Supplementary Fig. 5a, R2 = 0.74, y = 0.201 + 1.77x, p-value < 0.05 in Set-100). Across the majority of phyla, pathogens demonstrate a greater phylogenetic distance from the common ancestor compared to non-pathogens (Supplementary Fig. 5b). The results indicated that bacteria with greater phylogenetic distances were more likely to exhibit a higher redundancy ratio in their genomes, suggesting a trend of expanding redundant genes over the course of prokaryotic evolution.
Fig. 3. Evolutionary patterns of redundant genes and their translation regulation.
a Relationship of redundancy ratio and phylogenetic distance in Set-80. The column shows the average redundancy ratio per bin, with the horizontal axis indicating phylogenetic distance. Plus signs mark mean distances, and linear regression with 95% confidence intervals depicts the relationship between redundancy ratio and phylogenetic distance. b Assessment of function in redundant clusters in all COG categories in Set-80. Based on the cognition that most clusters have a cluster size of two, clusters with a ratio > 0.67 were classified as unfunctional, <0.33 as functional, and those in between as semi-functional. c Relationship of translational divergence with synonymous substitution rate. Redundant clusters with ds ≤ 1 (78.5%) were binned at 0.2 intervals (groups A-F), while those with ds > 1 formed group G. The average synonymous substitution rate of each bin from A to G in the horizontal axis is 0, 0.07, 0.29, 0.49, 0.70, 0.90, 8.17, the vertical axis shows the proportion of clusters with changes in translation initiation in each bin. d Relationship of translational divergence with nonsynonymous substitution rate. All redundant clusters with dn ≤ 1 were binned at 0.02 intervals. The average nonsynonymous substitution rate of each bin from A to G in the horizontal axis is 0, 0.01, 0.03, 0.05, 0.07, 0.09, 0.13, and the vertical axis shows the proportion of clusters with changes in translation initiation in each bin. Source data are provided as a Source Data file.
Pseudogenization, recognized as a primary mechanism leading to gene loss36, is attributed to the accumulation of disabling mutations that cause the insertion of a premature stop codon or disrupt the reading frame37,38. After conducting a thorough search for premature stop codons in the recently formed redundant genes (see Methods), it was intriguing to find that only 5.9% of all redundant genes in Set-80 and 6.0% in Set-100 possessed these codons (Supplementary Fig. 5c, d), suggesting that the majority of these recently formed redundant genes did not evolve into pseudogenes due to premature stop codons. Additionally, an average of only 4.4% of the redundant genes in each genome possessed premature stop codons. An exceptionally high proportion of premature stop codons has been observed in the genomes of some Mycoplasma species (87.8% ± 14.4%), potentially due to these small parasitic bacteria evolving towards genome reduction39.
In order to infer whether redundant genes have potential functionality, we examined the 5’ upstream regions of all redundant genes and successfully detected translation initiation signal motifs in the majority of these genes, suggesting that most redundant genes had structurally complete upstream translation initiation signals for expression (see Methods). Due to the presence of premature stop codons and translation initiation signals serving as an indicator of gene functionality, we subsequently categorized the clusters of redundant genes into functional, semi-functional, and non-functional groups. The results revealed that 91.0% (89.7% in Set-100) of the clusters in Set-80 were functional, 3.4% (0.7% in Set-100) of the clusters were semi-functional, and 5.7% (9.7% in Set-100) of the clusters in Set-80 were unfunctional, indicating that most daughter genes potentially preserve functionality after duplication and short-term evolution. The higher proportion of semi-functional clusters in Set-80 compared to Set-100 hinted that over longer periods of evolution, one copy might sometimes degrade into a pseudogene while the other remained functional. Additionally, we found that high proportions of redundant gene clusters were functional across all COG categories, although variations were observed between different categories (Fig. 3b and Supplementary Fig. 5e). Specifically, the proportions of functional clusters were much higher in categories pertaining to vital functions, such as transcription, translation, and basic metabolism, while the categories related to cellular processes and signaling, as well as the mobilome, were more likely to lose functionality.
To better understand the duplication process in prokaryotes, we compared the translation initiation signal motifs in redundant gene pairs and discovered that 93.2% of these clusters in Set-100 possessed exactly identical signal motifs. Among the remaining 6.8%, further analysis using Tomtom40 showed that 40.3% of these clusters exhibited highly similar signal motifs, with at least a five-nucleotide overlap (Tomtom, p-value < 0.05). Given that the redundant genes in Set-100 can be regarded as having been recently duplicated, these findings suggested a co-duplication of translation initiation signals and coding sequences, with minor variations in translation initiation signals of some clusters possibly attributable to subsequent mutations.
However, identical translation initiation signal motifs were found in only 53.0% of the redundant gene clusters in Set-80, and 33.2% of the remaining clusters showed very similar signal motifs (Tomtom, p-value < 0.05). These relatively large variations compared to Set-100 prompted us to delve deeper into the evolution of translation initiation signals over time. We calculated the synonymous substitution rate (ds) and categorized them into groups A to F at intervals of 0.2, with all remaining ds values greater than one classified into group G. The results showed that the proportion of redundant gene clusters with divergent translation initiation signals increased with ds (Fig. 3c, r = 0.89, p-value < 0.05, groups A to F), peaking in group G. Likewise, redundant gene clusters with greater nonsynonymous substitution rates (dn) were more prone to experiencing mutations in translation initiation signals (Fig. 3d, r = 0.92, p-value < 0.01, groups A to F). These findings indicate that the translation initiation signals of redundant genes have predominantly evolved in conjunction with their coding sequences.
The roles of redundant genes in pathogens during clinical infections and evolution: case studies in A. baumannii and ECC
A. baumannii and ECC are notable nosocomial pathogens, with the capability to cause a wide range of infections, including septicemia, urinary tract infections, and pneumonia41,42. To elucidate the impacts of redundant genes and their translation initiation signals in clinical infections and evolution, we designed a time series observation in intensive care units (ICUs) of a tertiary hospital in Eastern China for six months. We collected a total of 69 A. baumannii strains, encompassing 53 strains from multi-sites of 44 patients in ICUs (sputum, n = 20; stool, n = 11; urine, n = 10; blood, n = 12) and 16 strains from environmental surfaces (Fig. 4a, Supplementary Data 5, the strains are designated S1 to S69). As a result, we identified 2,471 redundant clusters, containing 5,136 redundant genes, with each strain averaging 35 redundant clusters and 74 redundant genes. One ARG, eptA, was prevalently redundant (56/69, 81.2%) in the isolates. As for the virulence-related redundancy, one environmental isolate acquired redundant flmH, a gene specifically found in Aeromonas hydrophila for polar flagella formation, and another sputum isolate obtained redundant KPN_02274, an ompA gene from K. pneumoniae for the translocation of effector molecules of the type VI secretion system (Supplementary Fig. 6).
Fig. 4. Sampling and experiment design for associations between ADH gene (frmA) redundancy and enhanced pathogenicity.
a Sampling design for the clinical implications of gene redundancy and experimental design to support the associations between ADH gene redundancy and enhanced pathogenicity. b Timeline of sample collection from two patients. Node colors indicate different isolation results and sequence types stratified by ADH gene copy number. The nodes linked by target curve arrows had the transmission relationships inferred by the cgSNV-based MST. The number of differential SNVs between the linked isolates was labeled on arrows. S1 to S5 represent different A. baumanni strains isolated from patients chronologically during the hospitalization. Periods of antibiotic administration are colored in the timeline. c Survival curves for mice infected by single- or double-copy frmA A. baumannii strains of the same clinically relevant lineage. The horizontal axis represents time after peritoneal bacterial injection, and the vertical axis represents the mouse survival rate. One strain with double copies of frmA (red) and three strains with a single-copy (blue) were tested, with five mice per strain (total n = 20). d The amounts of established biofilm on the plate lid of the 20 mice were measured. OD, optical density. p-value = 0.003, Wilcoxon rank-sum test, two-sided. Box plots indicate median (middle line), 25th, 75th percentile (box) and 5th and 95th percentile (whiskers) as well as outliers (single points). e Relative expression levels of endogenous frmA with S41-WT (A. baumannii S41) as the control strain and relative expression levels of exogenous plasmid-borne frmA with S41/frmA (A. baumannii S41/pYMAb2-frmA(ApmR)) as the control. f The amounts of established biofilm measured by a crystal violet staining assay at 595 nm. Error bars represent standard deviation, n = 3 independent replicates. Data are presented as mean values +/- SD. Source data are provided as a Source Data file. This figure was created with BioRender.com.
Remarkably, two genes within the ADH gene family, frmA (ADH class-III) and putative zinc-binding ADH (ZADH), were found as double copies (Set-80) in A. baumannii isolated from urinary tracts, in contrast to their single-copy presence in most strains from other sources (Fisher’s exact test, p-value < 0.05, Table 1, Supplementary Data 6, Supplementary Fig. 6). ADH serves as a pivotal element in modulating quorum sensing systems, enhancing bacterial motility, and facilitating biofilm formation and growth, with the latter responsible for nosocomial infections, especially in urinary tracts43.
Table 1.
Comparison of redundant genes in A. baumannii strains isolated from different sources
| Blood | Environment | Sputum | Stool | Urine | p-value (Urine vs. Others) |
|---|---|---|---|---|---|
| alcohol dehydrogenase class III (frmA) | |||||
| 2 (16.7%) | 3 (18.8%) | 2 (10%) | 1 (9.1%) | 7 (70%) | 0.001 |
| putative zinc-binding alcohol dehydrogenase (putative ZADH) | |||||
| 2 (16.7%) | 3 (18.8%) | 0 (0) | 1 (9.1%) | 6 (60%) | 0.001 |
| acetoin dehydrogenase | |||||
| 1 (8.3%) | 5 (31.2%) | 7 (35%) | 6 (54.5%) | 8 (80%) | 0.011 |
| aldehyde dehydrogenase | |||||
| 12 (1) | 12 (75%) | 20 (1) | 10(90.9%) | 9 (90%) | 1.000 |
| yxaF | |||||
| 5 (41.7%) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 1.000 |
| ephD | |||||
| 3 (25%) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 1.000 |
| astC | |||||
| 2 (16.7%) | 9 (56.2%) | 10 (50%) | 5 (45.5%) | 8 (80%) | 0.045 |
| tnsB | |||||
| 5 (41.7%) | 6 (37.5%) | 4 (2%) | 1 (9.1%) | 0 (0) | 0.102 |
Note: p-value <0.05; Fisher’s exact test.
Upon examining the gene-environment surrounding frmA and ZADH copies, we observed that the singular frmA copy, flanked by two hypothetical proteins, remained conserved across the majority of strains (67/69, 97.1%). In contrast, the additional frmA copy, surrounded by frmR and frmB, was detected in only 21.7% (15/69) of the strains (Fig. 5a). These three genes constituted the frmRA(B) operon, which was situated within plasmid-related gene islands identified by IslandViewer and frequently positioned downstream of insertion sequences (ISs) (11/15, 73.3%). The frmRA(B) was early detected in E. coli, and a variant was recently detected in a plasmid, Acinetobacter spp. Tol 5 (pTol5, AP024709.1) (Fig. 5b). While both the frmA copies owned intact translation initiation signals, they displayed limited similarities regarding coding and translation initiation signals (Tomtom, p-value > 0.05). The additional frmA copy, along with its upstream translation initiation signal, nested within the frmRA(B) operon, was homogeneous between A. baumannii and pTol5 (Fig. 5b). Therefore, the frmRA(B) operon might have undergone HGT events from the close relatives of A. baumannii.
Fig. 5. The redundant genes in A.baumannii and ECC.
a The gene-environment of frmA copies and putative ZADH copies. b Comparison between frmRA(B) operons from A.baumannii (this study), E. coli strain SQ2203, and Acinetobacter sp. Tol 5 plasmid, as well as the conserved frmA copy in A.baumannii (this study). c HGT of redundant genes between ECC and human gut microorganisms.
Regarding the putative ZADH gene copies, while one copy surrounded by two hypothetical proteins was consistently retained across most strains (68/69, 98.6%), the additional copy, positioned downstream of a phage fragment, tRNA-Asn(gtt) and ISs within plasmid-related gene islands identified by IslandViewer, was observed in 17.4% (12/69) of the strains (Fig. 5a). The translation initiation signals of both copies were intact and significantly similar (Tomtom, p-value < 0.05) in most cases (11/12, 91.7%). It has been reported that tRNA-Asn(gtt) serves as a recognition site for recombination or counterbalancing the compositional differences between the phage and bacterial genomes to adjust the translation capacity during bacterial infection44. This suggests that the observed redundancy likely arose from recombination events and offers further evidence for the efficient expression of the putative ZADH gene.
Interestingly, the two ADH genes were found to be co-located and closely situated within the A. baumanni genome in all the redundancy events of putative ZADH. Specifically, the redundant ZADH was located upstream of the IS-frmRA(B) structure with a hypothetical protein in between (Fig. 5a). We defined isolates with this kind of genetic arrangement as the multi-copy ADH genotype. To elucidate the acquisition and transmission dynamics of multi-copy ADH genotype A. baumanni isolates, we therefore investigated longitudinal isolates from two patients (P1 and P2) who developed nosocomial urinary infections after ICU admission (Fig. 4b). By constructing a minimum spanning tree (MST) based on core genome single nucleotide variations (cgSNVs), we revealed both within-host persistence and between-host transmission of the pathogen. Notably, the multi-copy ADH genotype A. baumanni isolates invaded urinary tracts after the initial colonization in the respiratory tracts in both P1 and P2. In P2, this invasive urinary infection even persisted for a month, with the multi-copy ADH genes and their translation initiation signals remaining stable over time. Our findings suggest that the redundancy in the ADH gene family potentially offers the ability of colonized A. baumannii strains to invade urinary tracts and establish persistent infections.
To validate the association between ADH redundancy and virulence, we conducted mouse peritoneal infection assays using four clinical isolates with single-copy frmA or double-copy frmA, with five mice per strain. Results showed that all mice infected with double-copy frmA strains succumbed within 24 hours, whereas those infected with single-copy frmA strains exhibited a higher survival rate throughout the 72-hour observation (Fig. 4c). The results suggest that double-copy frmA isolates exhibit increased virulence. As ADH has been reported to play a role in biofilm formation43, we further investigated whether its redundancy enhances the biofilm mass in A. baumannii. Our quantitative biofilm assays revealed that isolates with double-copy frmA (n = 7) produced significantly more biofilm mass compared to single-copy frmA isolates (n = 9, Fig. 4d, p-value < 0.05, Wilcoxon rank-sum test).
Furthermore, we validated the effect of frmA redundancy by bacterial transformation experiments. We constructed a plasmid carrying an extra copy of frmA and introduced it into two single-copy frmA isolates, A. baumannii S41 and A. baumannii S65, randomly selected from the clinical isolates (Supplementary Fig. 7a and b, see Methods, Table 2). Quantitative reverse transcription PCR (qRT-PCR) confirmed the successful introduction and expression of the frmA copy, which significantly increased biofilm formation in both strains (A. baumannii S41: Fig. 4e, f; p-value = 0.02, Student’s t-test; A. baumannii S65: Supplementary Fig. 7c; p-value = 0.078, Student’s t-test). When we further enhanced the frmA expression in A. baumannii S65 by replacing its native promoter with that of the highly conserved gene ompA, biofilm formation increased accordingly (Supplementary Fig. 7c; p-value < 0.05, Student’s t-test). These results support a positive association between frmA expression levels and biofilm mass in A. baumannii (Supplementary Fig. 7c).
Table 2.
Strains and plasmids used in this study
| Srain/Plasmid | Source |
|---|---|
| A.baumannii S12 | This study |
| A.baumannii S41 | This study |
| A.baumannii S65 | This study |
| A.baumannii S41/pYMAb2-frmA(ApmR) | This study |
| A.baumannii S65/pYMAb2-PompA::frmA(ApmR) | This study |
| A.baumannii S65/pYMAb2-frmA(ApmR) | This study |
| pCAP03-acc(3)IV | Our lab |
| pYMAb2-Hyg | Our lab |
| pYMAb2-ApmR | This study |
| pYMAb2-frmA(ApmR) | This study |
| pYMAb2-PompA::frmA(ApmR) | This study |
To gain further insights into the influence of redundant genes on resistance and virulence in pathogens, we downloaded 977 ECC genome assemblies from the NCBI and successfully reclassified 898 into six species17. Of these, 714 belonged to E. hormaechei, 65 to E. asburiae, 50 to E. cloacae, 26 to E. kobei, 25 to E. ludwigii and 18 to E. roggenkampii (see Methods). Using SoDpipe, we thus identified a total of 30,120 redundant clusters within 898 ECC genomes, containing 64,821 redundant genes. The redundancy ratio ranged from 0.2% to 7.9% in each ECC genome, averaging 1.5% (Supplementary Data 7).
The high proportion of E. hormaechei in our dataset, alongside its epidemiological predominance among ECC species in clinical and environmental settings42,45,46, motivated us to investigate whether redundant genes contribute to its dominance. By comparing E. hormaechei with all other ECC species, we discovered that 15 of 21 resistance and virulence-related redundant genes, including those conferring resistance to bactericidal compounds, anaerobic growth, mercury, and chemotaxis proteins, were significantly more prevalent in E. hormaechei (Fisher’s exact test, p-value < 0.05, Supplementary Data 8). Further analysis of these gene clusters suggested that most of them are potentially functional (Supplementary Data 8). Notably, 72.3% (516/714) of the E. hormaechei genomes possessed multiple copies of alkyl/aryl-sulfatase, a gene implicated in resistance to bile salts and sodium dodecyl sulfate47, whereas such redundancy was absent in other ECC species. Moreover, correlation analysis using a presence-absence matrix showed strong co-occurrence among certain redundant genes, particularly merR (P13111), merT (Q51769), and merP (P0A216), which are co-localized within the mercury resistance operon (mer; Supplementary Fig. 8a, b; Supplementary Data 8). Subsequent genome analysis of an isolate (GCF_000724505) demonstrated that two copies of the mer operon reside on separate plasmids, each carrying different antibiotic resistance and virulence genes. Among the 42 E. hormaechei isolates harboring redundant mer operons, we observed that the redundancy occurred in plasmid-associated regions in 61.9% isolates (26/42), suggesting that plasmid-mediated HGT commonly contributes to the spread and formation of mercury-resistance redundancy in E. hormaechei. As previously reported, the mer operon is often co-localized with other beneficial genes, such as those conferring antibiotic resistance, which may further enhance fitness48,49. In E. hormaechei, the retention of multiple plasmids carrying the mer operon not only strengthens mercury resistance but also enhances the overall fitness and competitiveness of the host strain across diverse environments. Overall, the high prevalence of these resistance- and virulence-related redundant genes in E. hormaechei likely contributes to its competitive advantage over other ECC species in different environments.
The redundant genes are primarily involved in membrane, transposition, metal resistance, and DNA recombination (Supplementary Fig. 8c), also indicating the pivotal role of HGT in acquiring multiple copies of resistance-related genes in the ECC genomes. To further detect HGT between ECC and their co-occurrent strains, we downloaded 456 human gut bacterial strains from the Human Microbiome Project (HMP) (Supplementary Data 9) and found that 1,218 of 48,565 genes in the ECC pangenome may transfer among gut bacteria via HGT. Among them, 59 were redundant genes (Fig. 5c and Supplementary Data 10). Notably, these transfers occurred almost exclusively among Gram-negative bacteria, particularly between ECC and species such as E. coli, K. pneumoniae, and Klebsiella species. The functions of the transferred redundant genes are largely associated with resistance and virulence, including resistance to silver, copper, and arsenic, as well as toxin-antitoxin systems. These findings highlighted the role of E. coli and Klebsiella as reservoirs for the horizontal transfer of multiple copies of resistance and virulence-related50,51, corroborating the aforementioned observations of A. baumannii.
We further discovered a gene island (Supplementary Data 11) that introduced 13 redundant genes along with 27 non-redundant genes, classified as a Copper and Silver Resistance Island (CSRI). This island included an array of metal resistance genes: five cation efflux system proteins, four copper resistance proteins, two silver-binding proteins (SilE), one putative copper-binding protein (PcoE), one sensor kinase (CusS), one transcriptional regulatory protein (CusR), one transcriptional activator protein (CopR), and one silver-exporting P-type ATPase. The identification of three insertion sequences (ISs) within the CSRI suggested that these metal resistance genes were likely gathered via upstream ISs. Given the extensive use of copper and silver ions as biocides in medical and healthcare fields, ECC’s enhanced ability to cope with copper and silver stress likely facilitates its survival52. Furthermore, since copper is essential for host defense, bacteria that develop copper resistance may better evade immune responses, representing a patho-adaptive strategy52. Similar to our observations on the mer operon, ample evidence indicates a positive correlation between heavy metal resistance and antibiotic resistance, with metal resistance facilitating the maintenance and spread of antibiotic resistance through co-selection53. Collectively, these findings shed light on the mechanisms underlying the transfer and persistence of redundant genes in hospital environments, highlighting the selective advantages they confer in colonization and pathogenicity.
Discussion
This study systematically elucidated the prevalence and characteristics of gene redundancy in prokaryotes by applying our integrated pipeline to distinct datasets of prokaryotic genomes and validating the role of gene redundancy in enhancing pathogenicity with microbial and in vivo experiments. The redundant genes were found to co-duplicate with translation initiation signals, likely to preserve functionality primarily related to adaptation, and exhibited an increasing trend over macroevolutionary scales. Notably, pathogens, with greater evolutionary divergence from common ancestors than non-pathogens, demonstrated a higher redundancy ratio. Additionally, our in-depth genomic time series analysis on A. baumannii isolates from multi-sites of ICU patients, along with the investigation of hundreds of ECC strains, exemplified the important roles of redundancy in pathogens. Interestingly, HGT-introduced redundancy potentially augmented the advantageous gene dosage for virulence and resistance in clinical infections and niche specialization as observed in our cases. These findings offer insights into comprehending the molecular mechanisms underlying the adaptation and infection of pathogens.
The prevalent gene redundancy confers increased gene dosage and activity of bacterial functions, especially adaptation to growth conditions or stresses, such as antibiotic exposure and host immunity, as observed in the current study and previously published ones54. Despite their high prevalence, redundant genes in prokaryotes are far less common than in eukaryotes. Owing to their simpler cellular organization, prokaryotes typically possess more compact and efficient genomes, with limited space to accommodate redundant genes55. To quantify the limitation, we performed linear regression analysis using the redundant gene numbers and genome sizes of 18,832 genomes (18,832/21,842, 86.2%) in Set-80 and 17,412 genomes in Set-100 (17,412/20,138, 86.5%). As a result, the redundant gene numbers were well-fitted against genome size (Supplementary Fig. 2; Set-80: p-value < 0.001, r = 0.534; Set-100: p-value < 0.001, r = 0.246). Our additional findings on the increasing redundancy of genes during macroevolution pointed out the higher rate of redundancy formation than loss within the acceptable long-term fitness cost of bacteria. We further found that the redundancy was not followed by frequent degeneration as previously seen in the higher proportion of redundancy in eukaryotes15,37, but was maintained, potentially functional, manifested by their coupling with the translation initiation signals and escaping the nonsense mutations. In eukaryotes, to reduce fitness cost, less than half of the recently generated duplications could be successfully expressed in C. elegans, with most duplicated genes being deleted or degenerating into nonfunctional forms in plants over evolutionary time15. Alternative approaches to mitigate the fitness costs associated with gene redundancy in prokaryotes are imperative to be investigated in the following research. It is also worth mentioning that our observation of the co-duplication of coding sequences and translation initiation signals potentially signifies a strategic evolutionary adaptation for prokaryotes to rapidly respond and adapt to changing environments, by functional links between gene products and their translation initiation signals during duplication and the regulatory response to the neofunctionalization and subfunctionalization after duplication.
In the context of a moderate genetic redundancy in prokaryotes, pathogens exhibit larger redundancy ratios than their non-pathogenic counterparts. This observation holds true when considering the entire bacterial kingdom and across most phyla in the current study. Orientia tsutsugamushi, commonly known for its high virulence56, stands out as the bacterium with the highest redundancy ratio. Based on our observed positive relationship between redundancy ratio and phylogenetic distance in macroevolution, we inferred that the greater redundancy generated in pathogens than non-pathogens is associated with their larger phylogenetic divergence from the common ancestor (0.59 vs. 0.54, p-value < 2.2 × 10−16). Thus, the higher redundancy ratio in pathogenic bacteria is not accidental or wasteful, but provides crucial genetic materials for adaptation. Redundancy, as a genetic alteration, occurs more rapidly than typical mutations and is regarded as the initial response to selective conditions in the growth environment of the bacterial population54. In this regard, mobile genetic elements, such as plasmids and gene islands, are important carriers of redundant genes54. Pathogens exist in microbial communities rather than in isolation, and they engage in frequent physical contact and exchange of genetic material with other microbes57. In the investigation of longitudinal A. baumannii isolates from multiple body sites, we identified redundancy in the ADH gene, frmA, introduced by plasmid-related gene islands specifically in urinary tract isolates and frequently co-located with a putative ADH gene, putative ZADH. The enhanced virulence associated with frmA redundancy was validated by biofilm assays and mouse peritoneal infection models. We observed that this advantageous increase in gene dosage was acquired during host infection progressing from the respiratory tract to the urinary tract, which potentially facilitated the invasive infection of A. baumannii strains57. Our mouse infection models confirmed the impact of frmA redundancy on virulence, and bacterial transformation experiments with recombinant plasmids proved that frmA redundancy is the cause of enhanced biofilm mass.
Our analyses of ECC genomes additionally disclosed that ECC has acquired multiple resistance and virulence genes via HGT. Notably, we observed extensive gene exchanges with other gram-negative bacteria, such as K. pneumoniae and E. coli in ESKAPE pathogens, which are common intestinal bacteria considered as the reservoirs of resistance and virulence genes. These exchanges include genes conferring resistance to heavy metals like mercury, copper, and silver. Such heavy metal resistance genes can facilitate the maintenance and spread of antibiotic resistance through co-selection mechanisms, providing selective advantages for colonization and persistence within hospital environments, acting as an advantage at the host/pathogen interface52. Our findings also emphasize that HGT-borne redundancy disseminates resistance and virulence across pathogens, serving as a key source of genetic adaptation for the co-existing microbes in microbial communities50,51.
Our study systematically delved into recently formed and evolved redundant genes in prokaryotes and delineated their impacts on adaptation and pathogenicity. The sampling design, genomic analyses and experiment efforts on A. baumannii in ICU patients represent the initial validation of the advantage conferred by gene redundancy in clinical infections. Nevertheless, it is imperative to acknowledge several limitations in our work. On the one hand, the redundant genes included in this study are constrained by the absence of a standardized definition. Actually, current studies commonly identify redundant genes using diverse standards, aligning with specific research objectives focusing on particular genes or species. On the other hand, we did not distinguish between orthologs, paralogs, and xenologs58.
To enhance our comprehension of prokaryotic gene redundancy, future prospective investigations on its effects and evolutionary trajectories are imperative. Employing advanced molecular techniques, such as gene knockout, replacement, and complementation, is essential for establishing the causal impacts of redundant genes on bacterial phenotypes. Redundant genes, conferring enhancement on virulence or resistance, emerge as promising targets of genetic modification, offering innovative avenues for diagnosing and treating infectious diseases. Concurrently, modeling the dynamics involved in gaining, retaining, and altering redundant genes will deepen our insight into the adaptive evolution of pathogens, particularly those with highly plastic genomes and frequent recombination, as exemplified by A. baumannii and ECC. Through subsequent replication, verification, and refinement, our discoveries on gene redundancy will expand the framework for exploring the genetic mechanisms of evolution and clinical infection for pathogenic prokaryotes.
Methods
Collection of complete genome data of prokaryotes and eukaryotic model organisms
We downloaded all the complete genomes of prokaryotes (not including extra-chromosomal genetic elements) released as of July 2021 in the NCBI RefSeq database, encompassing 21,928 bacterial genomes and 382 archaea genomes. We meticulously documented the RefSeq accession ID, taxonomy, GC content (%), genome size, and the number of coding sequences for each genome in Supplementary Data 1. We labeled the isolation source of each genome using the BacDive59 database and annotation provided by Sheinman et al.60. We then manually screened and classified the microbes into six groups: Terrestrial (n = 265), Marine (n = 383), Freshwater (other aquatic environments, excluding marine, including freshwater, lake, river, sediment, and sludge; n = 295), Human (n = 297), Plant (n = 123), and Bird (n = 113). Groups containing fewer than 100 genomes were removed. Subsequently, we retrieved the pathogenicity information for the microbes from the KEGG database using the R package KEGGREST61 and Mypathogen database62, as detailed in Supplementary Data 1 and available in the GitHub repository (https://github.com/Peihw/prokaryotes-g-e/tree/new/Source_data). The pathogenic and non-pathogenic annotations of E. coli were based on KEGG at the genome level. Isolates without clear documentation were categorized as “not determined,” as detailed in Supplementary Data 1.
We additionally downloaded the complete genomes of three eukaryotic model organisms: C. elegans (Accession ID in RefSeq: GCF_000002985.6), S. cerevisiae (from Saccharomyces Genome Database63) and A. thaliana (from Phytozome database64).
Collection, whole-genome sequencing, and genomic analysis of A.baumannii isolates
This study was part of a project entitled “The Role of Host Microbiome in the Pathogenesis and Prognosis of Severe Infection in Critically Ill Patients”. This project was reviewed and approved by the Research Ethics Committee of the First Affiliated Hospital, College of Medicine, Zhejiang University, China (Approval No. 2016-458-1). Written informed consent was obtained from all participants before enrolment, ensuring that they were fully informed about the collection of sputum, stool, urine, and blood samples for research purposes. We isolated A. baumannii within the ICUs of a tertiary hospital in Eastern China from July 2018 to January 2019. The strains were collected from sputum (n = 20), stool (n = 11), urine (n = 10) and blood (n = 12) of 44 patients within 24 hours of admission and during the hospitalization, as well as from the environmental surfaces (n = 16). A. baumannii isolates were identified by biochemical methods and Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight Mass Spectrometry (MALDI-TOF MS) (Bruker, Bremen, Germany) and were stored in the cryopreservation tube containing 20% glycerol broth in the refrigerator at -80 °C for subsequent analysis. Subsequently, we extracted DNA using a Qiagen DNA purification kit (Qiagen, Hilden, Germany). After quality control, concentration, purity, and amplification, PCR productions were subjected to paired-end sequencing on the Illumina HiSeq 4000 PE150 platform (Illumina, San Diego, USA). SPAdes65 was used to assemble scaffolds from raw data in the FASTQ format. We then used Prokka (v1.13)66 to annotate the genes in the scaffolds. Gene islands in A. baumannii genomes were detected by IslandViewer467.
To elucidate the spatiotemporal relationships among A. baumannii isolates, we identified non-recombination cgSNVs of isolates using Gubbins68 and Snippy69, and then built MST using CySpanningTree in Cytoscape70.
Collection and analysis of E. cloacae complex genome assembly
We retrieved assemblies for a total of 977 genomes of ECC from the NCBI RefSeq database released before 2018. ECC comprises multiple species that are undistinguishable using MALDI-TOF MS, the standard method for microbial identification in clinical laboratories71. In contrast, the average nucleotide identity (ANI) analysis based on whole genomes has been proven accurate in identifying microorganisms at the species level72,73. We calculated the ANI value between each genome and the type strain of each ECC species/subspecies (including E. kobei strain DSM 13645, E. asburiae strain ATCC 35953, E. ludwigii DSM16688T, E. cloacae subsp. cloacae ATCC 13047, E. cloacae subsp. dissolvens SDM, E. hormaechei subsp. hormaechei, E. hormaechei subsp. oharae strain DSM 16687, E. hormaechei subsp. steigerwaltii strain DSM 16691, E. xiangfangensis LMG27195, E. hormaechei, E. roggenkampii strain DSM 16690)17 using PYANI74. Finally, 898 ECC genomes with more than 95% ANI with the type strains were selected. According to the isolation source recorded by NCBI, these genomes were categorized into three groups: environment, clinical setting, and missing. The genomes from blood, urine, bodily fluid, or disease-related isolates were included in the clinical setting group (Supplementary Data 7). All the genomes were re-annotated using Prokka66.
Furthermore, we identified the redundant genes of ECC acquired through HGT events between distant bacterial taxa by detecting long exact sequence matches shared by pairs of genomes belonging to different genera. The methodology for HGT identification was proposed by Sheinman et al60. We obtained the taxonomic information and whole genome data of 456 human gut bacterial strains from HMP (Supplementary Data 9). Subsequently, we aligned ECC genomes with 453 strains from other genera and screened for long exact matches 300 bps to represent recent HGT events60. We further identified the potential gene islands in ECC genomes using IslandViewer467. Plasmid annotation was performed by aligning contigs with PlasmidFinder75 and two plasmids containing the mer operon in E. hormaechei (CP008824.1and CP008825.1).
The SoDpipe pipeline for identifying redundant genes and their translational signals
We developed a reproducible pipeline, SoDpipe, integrating multiple bioinformatics tools to aid in identifying redundant genes and their translation initiation signals. In the pipeline, we adopted the reliable “all against all” BLAST search across all RefSeq annotated protein-coding genes within each genome to identify redundant genes with BLASTClust (E-value 10−5). SoDpipe also includes MMseqs276 and CD-HIT77 options, providing a “faster” mode for users.
Using different settings of identity and coverage, we obtained two datasets: Set-100 and Set-80. In Set-100, genes were clustered within a genome with 100% amino acid sequence identity and 95% sequence coverage. In contrast, the 80% identity and 80% coverage thresholds were used for Set-80. Genes clustered in Set-100, with high similarity, are likely to represent recent duplicate genes, whereas genes in Set-80 may indicate redundant genes formed and evolved within a short period or horizontally transferred homologs. The redundancy ratio was defined as the division of the number of redundant genes by the total number of protein-coding genes in specific prokaryotic genomes.
To understand the redundant genes in terms of their expression, we integrated our previously published tools or algorithms78,79 to identify the translation initiation signals of each gene. TriTISA, our de novo tool, was designed to determine the translation initiation site (TIS) for each CDS80. It first classifies all candidate TISs, referring to in-frame start-codon-like triplets (namely, ATG, CTG, GTG, or TTG) within the open reading frame, into three categories (true TISs, false TISs upstream of the true TIS and false TISs downstream of the true TIS) based on the category-specific evolutionary properties. It then characterizes the statistical properties of candidate TISs in each category using non-homogeneous n-th-order Markov models and calculates the Bayesian probability of being a true TIS for each candidate TIS. TIS with a probability 0.7 and shifted within ± 36 nt were selected in this study.
We then identify the signals that locate upstream TISs and play a role in recruiting mRNA to initiate translation. Specifically, we identified candidate signals within regions of 20 or 30 bps upstream of TISs of all bacterial or archaeal genes using the expectation-maximization algorithm for parameter estimation. We subsequently classified the signals by comparing their position weight matrix (PWM) models with the standard models of six types of signals summarized in our ProTISA database specifically designed for collecting TIS data. Consequently, a gene can be annotated with its translation initiation signal or marked as having no signal (see Supplementary Information for details).
Phylogenetic distance calculation and molecular evolutionary analysis
To investigate the putative functions of redundant genes, we aligned their gene or protein sequences against COG25, DEG28, Swiss-Prot81, VFDB82, and CARD83 databases using BLASTP with an E-value cutoff of 10−5 and identity cutoff of 30%.
We further explored the fate of redundant genes. On the one hand, we analyzed the trends in the number of redundant genes during the macroevolution of prokaryotes. We downloaded the representative phylogenetic tree constructed using 16S rRNA gene sequences of all sequenced type strains of archaeal and bacterial species from SILVA (https://imedea.uib-csic.es/mmg/ltp/ltp-2020/)35. The tree encompassed 3,377 species with 16,112 genomes included in this study, while other species were not covered due to the absence of type strains. We then calculated the phylogenetic distance from each species (leaf node) to the root (the intermediate node between bacteria and archaea). We subsequently fitted linear models, using the lm function in R, to analyze the relationship between the phylogenetic distance and redundant ratio. On the other hand, we identified the pseudogenized or non-functional redundant genes when the coding sequences were defective due to premature stop codons, or the genes lacked signal sequences needed for translation initiation. We then assigned each redundant cluster into functional, semi-functional, and unfunctional, by calculating the proportion of the unfunctional genes within it.
In the molecular evolutionary analyses, we first aligned protein sequences within each redundant gene cluster using Clustal Omega84 and converted them into the corresponding codon alignment using PAL2NAL85. We then calculated the synonymous (ds) and nonsynonymous (dn) substitution rates using the codeml program in PAML86. Simultaneously, we quantified the dissimilarities between TIS signals for any two redundant genes using the TomTom algorithm. Subsequently, we calculated the Pearson correlations between the TIS signal dissimilarity and the substitution rates of the coding region using the cor.test function in R, to analyze the regulatory response during the mutation of redundant genes.
Mouse peritoneal infection models
We injected bacterial suspensions into the peritoneal cavity of eight-week-old female BALB/c mice. Each mouse received an injection of 100 μL containing 1 × 10⁶ CFU of the respective strain, with five mice being injected for each strain. The physical condition of each mouse was monitored every 12 hours. All surviving mice were euthanized after 7 days post-injection. All animal experiments were conducted in accordance with the guidelines of the Institutional Animal Care and Ethics Committee at The First Affiliated Hospital of Zhejiang University, School of Medicine.
Biofilm quantitative assay
The Biofilm Formation Assay Kit (Dojindo, B601) is used to evaluate biofilm formation in bacterial strains. A single colony of each strain was inoculated into Luria-Bertani (LB) broth and cultured overnight at 37 °C. The bacterial suspension was then diluted to a McFarland standard with a turbidity of 0.5. Then, 180 μL of the bacterial suspension from each strain was transferred to a sterile 96-well polystyrene plate, and a 96-peg lid was placed on the plate. The plate was incubated at 37 °C for 48 hours, allowing biofilm to form on the surface of the needle-like projections. Three replicates were prepared for each strain, with sterile LB broth used as the negative control. After incubation, the 96-peg lid was dipped and washed twice in a new 96-well polystyrene plate filled with sterilized physiological saline. The lid was then immersed in a staining solution containing crystal violet and left to stand for 30 minutes at room temperature. After staining, it was dipped and washed twice with sterilized physiological saline. To quantify biofilm formation, the 96-peg lid was immersed in 200 μL of absolute ethanol and left to stand for 15 minutes at room temperature to dissolve the crystal violet. The optical density (OD) was measured at 595 nm using a microplate reader. All determinations were made in at least three independent experiments for each group.
Plasmid construction and transformation
The pYMAb2-Hyg plasmid was selected for recombination experiments (Supplementary Fig. 7a)87. Given the strain’s resistance to hygromycin, the apramycin resistance gene (apmR) was amplified by PCR from the pCAP03-acc(3)IV template, and the pYMAb2-Hyg plasmid was linearized by PCR. The resulting fragments were purified using the SteadyPure PCR Purification Kit (Accurate Biology, AG21003) and subsequently recombined using the ClonExpress Ultra One Step Cloning Kit (Vazyme, C115-01) to replace the hygromycin resistance gene (hygR) with apmR, generating the plasmid pYMAb2-ApmR.
The extra copy of frmA gene, including its promoter region, was amplified from the wild-type double-copy strain A. baumannii S12 and cloned into pYMAb2-ApmR to construct pYMAb2-frmA(ApmR). This plasmid was then introduced into A. baumannii S41 and A. baumannii S65 via electroporation. Positive transformants were confirmed by PCR and Sanger sequencing, and the validated strains were designated A. baumannii S41/pYMAb2-frmA(ApmR) and A. baumannii S65/pYMAb2-frmA(ApmR).
Additionally, we constructed a plasmid with the promoter of the highly conserved gene ompA replacing the original promoter88, resulting in pYMAb2-PompA(ApmR), which produced a higher expression phenotype (Supplementary Fig. 7b). This plasmid was introduced into A. baumannii S65, resulting in the strain A. baumannii SD/pYMAb2-PompA::frmA(ApmR). All strains and plasmids used in this study are listed in Table 2, and the primers are provided in Supplementary Data 2. Primers were compounded through Sangon Biotech Co., Ltd. (Shanghai, China).
Quantitative reverse transcription PCR
We refer to the conserved single-copy frmA present in all strains as endogenous frmA, while the acquired redundant copy is designated as exogenous plasmid-borne frmA. Quantitative reverse transcription PCR was performed to measure the expression levels of the endogenous frmA copy and exogenous plasmid-borne frmA copy in the strains A. baumannii S41, A. baumannii S65, A. baumannii S41/pYMAb2-frmA(ApmR), A. baumannii S65/pYMAb2-frmA(ApmR) and A. baumannii S65/pYMAb2-PompA::frmA(ApmR). Total RNA was extracted using the AFTSpin Bacterial Fast RNA Extraction Kit (ABclonal Technology, RK30123), and reverse transcription was carried out using the Evo M-MLV Reverse Transcription Kit (Accurate Biology, AG11706). Quantitative PCR was performed using the TB Green® Premix Ex Taq™ (Takara, RR420Q) on a Bio-Rad CFX96 instrument. Gene expression was calculated using the 2−∆∆Ct method89, with the rplB gene as the reference gene. Three independent RT-qPCR experiments were conducted, each using newly extracted RNA. The primers used are listed in Supplementary Data 2.
Statistical analysis
Discrete variables were tested by the two-tailed Fisher’s exact test. Paired and unpaired continuous variables, represented by mean ± standard deviation (SD), were tested by the two-tailed Wilcoxon signed-rank and rank-sum tests, respectively. We performed the above statistical tests on MetaComp90. Statistical significance was considered when p-value < 0.05 or false discovery rate (FDR) adjusted p-valueadj < 0.05.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Description of Additional Supplementary Files
Source data
Acknowledgements
This work was supported by the National Key Research and Development Program of China (2021YFC2300300 to Y.X. and H.Z, 2017YFC1200205 to H.Z. and Y.X.) and the National Natural Science Foundation of China (32070667 to H.Z., 82202588 to T.X., 32300078 to X.J.). Part of the analysis was performed on the High-Performance Computing Platform of Peking University.
Author contributions
HQZ supervised the study. PHW, QG, XQJ and HQZ designed the study. PHW performed the major analyses on all prokaryotes and ECC. QG and MRC performed the analyses of A. baumannii. PL conducted the plasmid construction and transformation experiments. LSY improved translation initiation signal annotation. TTX and YHX collected A. baumannii samples. XQJ, QG, PHW, PL, MRC and HQZ wrote and revised the manuscript. ML, CHW, and all the authors proofread and improved it.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Data availability
Accessions for publicly available genomic data in RefSeq database are given in the Supplementary Data 1. For A. baumannii sampled and sequenced in this study, the genomic data for strains S1 to S62 are available in NCBI under accession code PRJNA1124179 and data for strains S63 to S69 can accessed in NCBI under accession code PRJNA1120174; detailed BioSample IDs in Supplementary Data 5). Source data are provided as a Source Data file and in our GitHub repository [https://github.com/Peihw/prokaryotes-g-e/tree/new/Source_data]. Other accession codes listed in this study are as follow: CP008824.1 [https://www.ncbi.nlm.nih.gov/nuccore/CP008824.1], CP008825.1 [https://www.ncbi.nlm.nih.gov/nuccore/CP008825.1], AP024709.1 [https://www.ncbi.nlm.nih.gov/nuccore/AP024709.1], GCF_000002985.6 [https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000002985.6/], GCF_000724505 [https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/724/505/]. Source data are provided with this paper.
Code availability
The code for SoDpipe91 and bioinformatics analyses in this paper can be found at https://github.com/Peihw/prokaryotes-g-e.
Competing interests
The authors have declared that no competing interests exist.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Peihong Wang, Qian Guo, Xiaoqing Jiang, Ping Lu.
Contributor Information
Yonghong Xiao, Email: xiao-yonghong@163.com.
Huaiqiu Zhu, Email: hqzhu@pku.edu.cn.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-025-65840-7.
References
- 1.Láruson, ÁJ., Yeaman, S. & Lotterhos, K. E. The Importance of Genetic Redundancy in Evolution. Trends Ecol. Evol.35, 809–822 (2020). [DOI] [PubMed] [Google Scholar]
- 2.Coe, B. P. et al. Neurodevelopmental disease genes implicated by de novo mutation and copy number variation morbidity. Nat. Genet51, 106–116 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tian, L. et al. Deciphering functional redundancy in the human microbiome. Nat. Commun.11, 6217 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Louca, S. et al. Function and functional redundancy in microbial systems. Nat. Ecol. Evol.2, 936–943 (2018). [DOI] [PubMed] [Google Scholar]
- 5.Fajardo, D., Saint Jean, R. & Lyons, P. J. Acquisition of new function through gene duplication in the metallocarboxypeptidase family. Sci. Rep.13, 2512 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Qian, W. & Zhang, J. Genomic evidence for adaptation by gene duplication. Genome Res24, 1356–1362 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Bratlie, M. S. et al. Gene duplications in prokaryotes can be associated with environmental adaptation. BMC Genomics11, 588 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.diCenzo, G. C. & Finan, T. M. Genetic redundancy is prevalent within the 6.7 Mb Sinorhizobium meliloti genome. Mol. Genet Genomics290, 1345–1356 (2015). [DOI] [PubMed] [Google Scholar]
- 9.Maddamsetti, R. et al. Duplicated antibiotic resistance genes reveal ongoing selection and horizontal gene transfer in bacteria. Nat. Commun.15, 1449 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Rodríguez-Beltrán, J., DelaFuente, J., Leon-Sampedro, R., MacLean, R. C. & San Millan, A. Beyond horizontal gene transfer: the role of plasmids in bacterial evolution. Nat. Rev. Microbiol.19, 347–359 (2021). [DOI] [PubMed] [Google Scholar]
- 11.Brakhage, A. A. et al. Aspects on evolution of fungal beta-lactam biosynthesis gene clusters and recruitment of trans-acting factors. Phytochemistry70, 1801–1811 (2009). [DOI] [PubMed] [Google Scholar]
- 12.Zhang, J. Z. Evolution by gene duplication: an update. Trends Ecol. Evolution18, 292–298 (2003). [Google Scholar]
- 13.Birchler, J. A. & Yang, H. The multiple fates of gene duplications: deletion, hypofunctionalization, subfunctionalization, neofunctionalization, dosage balance constraints, and neutral variation. Plant Cell34, 2466–2474 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Sheridan, P. O. et al. Gene duplication drives genome expansion in a major lineage of Thaumarchaeota. Nat. Commun.11, 5494 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Panchy, N., Lehti-Shiu, M. & Shiu, S. H. Evolution of gene duplication in plants. Plant Physiol.171, 2294–2316 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Soucy, S. M., Huang, J. & Gogarten, J. P. Horizontal gene transfer: building the web of life. Nat. Rev. Genet.16, 472–482 (2015). [DOI] [PubMed] [Google Scholar]
- 17.Chavda, K. D. et al. Comprehensive genome analysis of carbapenemase-producing Enterobacter spp.: new insights into phylogeny, population structure, and resistance mechanisms. mBio7 (2016). [DOI] [PMC free article] [PubMed]
- 18.Bernabeu, M. et al. Gene duplications in the E. coli genome: common themes among pathotypes. BMC Genomics20, 313 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Xu, Y. et al. Whole genome sequencing revealed host adaptation-focused genomic plasticity of pathogenic Leptospira. Sci. Rep.6, 20020 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shi, X., Xia, Y., Wei, W. & Ni, B. J. Accelerated spread of antibiotic resistance genes (ARGs) induced by non-antibiotic conditions: Roles and mechanisms. Water Res.224, 119060 (2022). [DOI] [PubMed] [Google Scholar]
- 21.Mulani, M. S., Kamble, E. E., Kumkar, S. N., Tawre, M. S. & Pardesi, K. R. Emerging strategies to combat ESKAPE pathogens in the era of antimicrobial resistance: a review. Front Microbiol10, 539 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Thompson, T. The staggering death toll of drug-resistant bacteria. Nature, 10.1038/d41586-022-00228-x (2022). [DOI] [PubMed]
- 23.Oliveira, P. H., Touchon, M. & Rocha, E. P. The interplay of restriction-modification systems with mobile genetic elements and their prokaryotic hosts. Nucleic Acids Res.42, 10618–10631 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Liao, H. et al. Metagenomic and viromic analysis reveal the anthropogenic impacts on the plasmid and phage borne transferable resistome in soil. Environ. Int170, 107595 (2022). [DOI] [PubMed] [Google Scholar]
- 25.Galperin, M. Y., Makarova, K. S., Wolf, Y. I. & Koonin, E. V. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic acids Res.43, D261–D269 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Zhang, Y. et al. Specific adaptation of Ustilaginoidea virens in occupying host florets revealed by comparative and functional genomics. Nat. Commun.5, 3849 (2014). [DOI] [PubMed] [Google Scholar]
- 27.Mikalsen, T. et al. Investigating the mobilome in clinically important lineages of Enterococcus faecium and Enterococcus faecalis. BMC Genomics16, 282 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Luo, H. et al. DEG 15, an update of the database of essential genes that includes built-in analysis tools. Nucleic Acids Res49, D677–D686 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Arun, P. V. et al. Identification and functional analysis of essential, conserved, housekeeping and duplicated genes. FEBS Lett.590, 1428–1437 (2016). [DOI] [PubMed] [Google Scholar]
- 30.Niu, Y., Zhang, R. & Yuan, J. Flagellar motors of swimming bacteria contain an incomplete set of stator units to ensure robust motility. Sci. Adv.9, eadi6724 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Moore, J. K., Doney, S. C., Glover, D. M. & Fung, I. Y. Iron cycling and nutrient-limitation patterns in surface waters of the world ocean. Deep Sea Res. Part II: Topical Stud. Oceanogr.49, 463–507 (2001). [Google Scholar]
- 32.Bulyha, I., Hot, E., Huntley, S. & Søgaard-Andersen, L. GTPases in bacterial cell polarity and signalling. Curr. Opin. Microbiol.14, 726–733 (2011). [DOI] [PubMed] [Google Scholar]
- 33.Weber, M. H. & Marahiel, M. A. Bacterial cold shock responses. Sci. Prog.86, 9–75 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Parsons, J. B. & Rock, C. O. Bacterial lipids: metabolism and membrane homeostasis. Prog. Lipid Res52, 249–276 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ludwig, W. et al. Release LTP_12_2020, featuring a new ARB alignment and improved 16S rRNA tree for prokaryotic type strains. Syst. Appl Microbiol44, 126218 (2021). [DOI] [PubMed] [Google Scholar]
- 36.Chu, X., Li, S., Wang, S., Luo, D. & Luo, H. Gene loss through pseudogenization contributes to the ecological diversification of a generalist Roseobacter lineage. ISME J.15, 489–502 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lynch, M. & Conery, J. S. The evolutionary fate and consequences of duplicate genes. Science290, 1151–1155 (2000). [DOI] [PubMed] [Google Scholar]
- 38.Liu, Y., Harrison, P. M., Kunin, V. & Gerstein, M. Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes. Genome Biol.5, R64 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Wang, Y. et al. Phylogenomics of expanding uncultured environmental Tenericutes provides insights into their pathogenicity and evolutionary relationship with Bacilli. BMC Genomics21, 408 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying similarity between motifs. Genome biology8 (2007). [DOI] [PMC free article] [PubMed]
- 41.Ibrahim, M. E. Prevalence of Acinetobacter baumannii in Saudi Arabia: risk factors, antimicrobial resistance patterns and mechanisms of carbapenem resistance. Ann. Clin. Microbiol Antimicrob.18, 1 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Xu, T. et al. Frequent convergence of mcr-9 and carbapenemase genes in Enterobacter cloacae complex driven by epidemic plasmids and host incompatibility. Emerg. Microbes Infect.11, 1959–1972 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Zhang, K. et al. Alcohol dehydrogenase modulates quorum sensing in biofilm formations of Acinetobacter baumannii. Micro. Pathog.148, 104451 (2020). [DOI] [PubMed] [Google Scholar]
- 44.Huang, H. et al. Complete sequence of pABTJ2, a plasmid from Acinetobacter baumannii MDR-TJ, carrying many phage-like elements. Genomics Proteom. Bioinforma.12, 172–177 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Dong, X. et al. Whole-genome sequencing-based species classification, multilocus sequence typing, and antimicrobial resistance mechanism analysis of the Enterobacter cloacae complex in southern china. Microbiol Spectr.10, e0216022 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Aoki, K. et al. Molecular Characterization of IMP-1-producing Enterobacter cloacae complex isolates in tokyo. Antimicrob Agents Chemother62, 10.1128/aac.02091-17 (2018). [DOI] [PMC free article] [PubMed]
- 47.Liang, Y. J., Gao, Z. Q., Dong, Y. H. & Liu, Q. S. Structural and functional analysis show that the Escherichia coli uncharacterized protein YjcS is likely an alkylsulfatase. Protein Sci.23, 1442–1450 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Li, X. et al. Plasmid genomes reveal the distribution, abundance, and organization of mercury-related genes and their co-distribution with antibiotic resistant genes in Gammaproteobacteria. Genes (Basel)13 (2022). [DOI] [PMC free article] [PubMed]
- 49.Kuzmin, E., Taylor, J. S. & Boone, C. Retention of duplicated genes in evolution. Trends Genet38, 59–72 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Salyers, A. A., Gupta, A. & Wang, Y. Human intestinal bacteria as reservoirs for antibiotic resistance genes. Trends Microbiol12, 412–416 (2004). [DOI] [PubMed] [Google Scholar]
- 51.Holt, K. E. et al. Genomic analysis of diversity, population structure, virulence, and antimicrobial resistance in Klebsiella pneumoniae, an urgent threat to public health. Proc. Natl Acad. Sci. USA112, E3574–E3581 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Virieux-Petit, M., Hammer-Dedet, F., Aujoulat, F., Jumas-Bilak, E. & Romano-Bertrand, S. From copper tolerance to resistance in Pseudomonas aeruginosa towards patho-adaptation and hospital success. Genes (Basel)13 (2022). [DOI] [PMC free article] [PubMed]
- 53.Pal, C. et al. Metal resistance and its association with antibiotic resistance. Adv. Micro. Physiol.70, 261–313 (2017). [DOI] [PubMed] [Google Scholar]
- 54.Sandegren, L. & Andersson, D. I. Bacterial gene amplification: implications for the evolution of antibiotic resistance. Nat. Rev. Microbiol7, 578–588 (2009). [DOI] [PubMed] [Google Scholar]
- 55.Kirchberger, P. C., Schmidt, M. L. & Ochman, H. The ingenuity of bacterial genomes. Annu Rev. Microbiol74, 815–834 (2020). [DOI] [PubMed] [Google Scholar]
- 56.Taylor, A. J., Paris, D. H. & Newton, P. N. A Systematic review of mortality from untreated scrub typhus (Orientia tsutsugamushi). PLoS Negl. Trop. Dis.9, e0003971 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Weiland-Bräuer, N. Friends or foes-microbial interactions in nature. Biology (Basel)10 (2021). [DOI] [PMC free article] [PubMed]
- 58.Koonin, E. V. Orthologs, paralogs, and evolutionary genomics. Annu Rev. Genet39, 309–338 (2005). [DOI] [PubMed] [Google Scholar]
- 59.Reimer, L. C. et al. BacDive in 2022: the knowledge base for standardized bacterial and archaeal data. Nucleic Acids Res50, D741–D746 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Sheinman, M. et al. Identical sequences found in distant genomes reveal frequent horizontal transfer across the bacterial domain. Elife10 (2021). [DOI] [PMC free article] [PubMed]
- 61.Tenenbaum, D., Maintainer, B. KEGGREST: Client-side REST access to the kyoto encyclopedia of genes and genomes (KEGG). https://bioconductor.org/packages/KEGGREST (2025).
- 62.Zhang, T., Miao, J., Han, N., Qiang, Y. & Zhang, W. MPD: a pathogen genome and metagenome database. Database (Oxford)2018, 10.1093/database/bay055 (2018). [DOI] [PMC free article] [PubMed]
- 63.Cherry, J. M. et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic acids Res.40, D700–D705 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Goodstein, D. M. et al. Phytozome: a comparative platform for green plant genomics. Nucleic acids Res.40, D1178–D1186 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Deng, M., Jiang, R., Sun, F. & Zhang, X. Research in computational molecular biology: 17th annual international conference, RECOMB 2013, Beijing, China, April 7-10, 2013. Proceedings. (Springer Berlin Heidelberg, 2013).
- 66.Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics30, 2068–2069 (2014). [DOI] [PubMed] [Google Scholar]
- 67.Bertelli, C. et al. IslandViewer 4: expanded prediction of genomic islands for larger-scale datasets. Nucleic Acids Res45, W30–W35 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Croucher, N. J. et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic acids research43 (2015). [DOI] [PMC free article] [PubMed]
- 69.Torsten, S. Snippy. https://github.com/tseemann/snippy (2020).
- 70.Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res13, 2498–2504 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Ji, Y. et al. Development of a one-step multiplex PCR assay for differential detection of four species (Enterobacter cloacae, Enterobacter hormaechei, Enterobacter roggenkampii, and Enterobacter kobei) belonging to Enterobacter cloacae complex with clinical significance. Front Cell Infect. Microbiol11, 677089 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Ciufo, S. et al. Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI. Int J. Syst. Evol. Microbiol68, 2386–2392 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Xiao, T. et al. Comparative RESPIRATORY TRACT MICROBIOME BETWEEN CARBAPENEM-RESISTANT Acinetobacter baumannii colonization and ventilator associated pneumonia. Front Microbiol13, 782210 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Pritchard, L., Glover, R. H., Humphris, S., Elphinstone, J. G. & Toth, I. K. Genomics and taxonomy in diagnostics for food security: soft-rotting enterobacterial plant pathogens. Anal. Methods-Uk8, 12–24 (2016). [Google Scholar]
- 75.Carattoli, A. et al. In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob. Agents Chemother.58, 3895–3903 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol.35, 1026–1028 (2017). [DOI] [PubMed] [Google Scholar]
- 77.Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics28, 3150–3152 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Hu, G. Q. et al. ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes. Nucleic acids Res.36, D114–D119 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Zheng, X. B., Hu, G. Q., She, Z. S. & Zhu, H. Q. Leaderless genes in bacteria: clue to the evolution of translation initiation mechanisms in prokaryotes. BMC Genomics12, 361 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Hu, G. Q., Zheng, X. B., Zhu, H. Q. & She, Z. S. Prediction of translation initiation site for microbial genomes with TriTISA. Bioinformatics25, 123–125 (2009). [DOI] [PubMed] [Google Scholar]
- 81.Famiglietti, M. L. et al. An enhanced workflow for variant interpretation in UniProtKB/Swiss-Prot improves consistency and reuse in ClinVar. Database2019, baz040 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Chen, L., Zheng, D., Liu, B., Yang, J. & Jin, Q. VFDB 2016: hierarchical and refined dataset for big data analysis-10 years on. Nucleic acids Res.44, D694–D697 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Jia, B. et al. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic acids Res.45, D566–D573 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci.27, 135–145 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Suyama, M., Torrents, D. & Bork, P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic acids Res.34, W609–W612 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Yang, Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci.13, 555–556 (1997). [DOI] [PubMed] [Google Scholar]
- 87.Zhang, L. et al. Phenotypic variation and carbapenem resistance potential in OXA-499-producing Acinetobacter pittii. Front Microbiol11, 1134 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Iyer, R., Moussa, S. H., Durand-Réville, T. F., Tommasi, R. & Miller, A. Acinetobacter baumannii OmpA is a selective antibiotic permeant porin. ACS Infect. Dis.4, 373–381 (2018). [DOI] [PubMed] [Google Scholar]
- 89.Schmittgen, T. D. & Livak, K. J. Analyzing real-time PCR data by the comparative CT method. Nat. Protoc.3, 1101–1108 (2008). [DOI] [PubMed] [Google Scholar]
- 90.Zhai, P. et al. MetaComp: comprehensive analysis software for comparative meta-omics including comparative metagenomics. BMC Bioinforma.18, 434 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Wang, P. H. et al. Deciphering gene redundancy in prokaryotic genomes provides evolutionary insights for pathogenicity and its roles in clinical infections. Peihw/prokaryotes-g-e, 10.5281/zenodo.17206586 (2025). [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Description of Additional Supplementary Files
Data Availability Statement
Accessions for publicly available genomic data in RefSeq database are given in the Supplementary Data 1. For A. baumannii sampled and sequenced in this study, the genomic data for strains S1 to S62 are available in NCBI under accession code PRJNA1124179 and data for strains S63 to S69 can accessed in NCBI under accession code PRJNA1120174; detailed BioSample IDs in Supplementary Data 5). Source data are provided as a Source Data file and in our GitHub repository [https://github.com/Peihw/prokaryotes-g-e/tree/new/Source_data]. Other accession codes listed in this study are as follow: CP008824.1 [https://www.ncbi.nlm.nih.gov/nuccore/CP008824.1], CP008825.1 [https://www.ncbi.nlm.nih.gov/nuccore/CP008825.1], AP024709.1 [https://www.ncbi.nlm.nih.gov/nuccore/AP024709.1], GCF_000002985.6 [https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000002985.6/], GCF_000724505 [https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/724/505/]. Source data are provided with this paper.
The code for SoDpipe91 and bioinformatics analyses in this paper can be found at https://github.com/Peihw/prokaryotes-g-e.





