Skip to main content
Microbial Genomics logoLink to Microbial Genomics
. 2023 May 25;9(5):mgen001021. doi: 10.1099/mgen.0.001021

Challenges in prokaryote pangenomics

Gerry Tonkin-Hill 1,*, Jukka Corander 1,2,3, Julian Parkhill 4
PMCID: PMC10272878  PMID: 37227251

Abstract

Horizontal gene transfer (HGT) and the resulting patterns of gene gain and loss are a fundamental part of bacterial evolution. Investigating these patterns can help us to understand the role of selection in the evolution of bacterial pangenomes and how bacteria adapt to a new niche. Predicting the presence or absence of genes can be a highly error-prone process that can confound efforts to understand the dynamics of horizontal gene transfer. This review discusses both the challenges in accurately constructing a pangenome and the potential consequences errors can have on downstream analyses. We hope that by summarizing these issues researchers will be able to avoid potential pitfalls, leading to improved bacterial pangenome analyses.

Keywords: pangenome, bacteria, gene annotation, horizontal gene transfer

Introduction

Prokaryotic species exhibit remarkable variation in the gene content of individual genomes at both the species and lineage levels. Following early observations that a set of Escherichia coli genomes shared only a fraction of their genes, larger population studies led to the concept of the prokaryotic ‘pangenome’, which refers to the entire collection of genes found within a species [1, 2]. The diversity in gene content and the evolution of pangenomes is driven vertically by gene duplication and gene fusion/fission, and horizontally by the transfer of DNA through a variety of mechanisms, including direct contact between bacterial cells and the uptake of DNA from the environment. Horizontal gene transfer (HGT) is facilitated by mobile genetic elements (MGEs) such as insertion sequences (ISs), transposons, integrative conjugative elements (ICEs), integrative mobilizable elements (IMEs), plasmids and phages [3]. The dynamics of pangenome diversity plays a central role in the evolution of prokaryotes, including in niche adaptation, competition within and between species, and in the case of pathogens, the development and maintenance of antimicrobial resistance, virulence and vaccine evasion [4–6].

Major barriers to understanding these dynamics are introduced by errors in the automated annotation, clustering and classification of orthologous and paralogous genes (Fig. 1) [7–10]. Similar to genome-wide association studies (GWASs) and phylodynamic analyses, population structure can also significantly bias efforts to understand the evolutionary dynamics of pangenomes [11–13]. This review discusses some of the major challenges these artefacts present to the analysis of bacterial pangenomes. These challenges can be broadly divided into the bioinformatics challenges of annotating, clustering and categorizing genes and the related problem of modelling the dynamics, selection and function of genes within pangenomes. The distinct but related problem of identifying fine-scale variation such as single-nucleotide polymorphisms (SNPs) is left to other publications [14, 15]. We hope that by describing these challenges and emerging strategies for dealing with them, researchers will be able to avoid some of the major pitfalls in the analysis of prokaryote pangenomes.

Fig. 1.

Fig. 1.

A schematic indicating the main steps in pangenome inference and the sources of error that contribute at each stage. The sections covered in this review are highlighted in grey.

Pangenome inference

The bioinformatics challenge of inferring a bacterial pangenome can be broadly split into the problems of assembly, annotation, classification and clustering of genes and intergenic sequences. Errors introduced at each stage will propagate to later steps, which can compound their impact. Small errors in individual genomes will also compound as the data set increases in size. With the increasing interest in bacterial pangenomes, methods are being developed that account for and correct these artefacts.

Automatic gene annotation

Genome annotation is one of the central challenges in the analysis of prokaryotic pangenomes. While the scale of genome sequencing and assembly has increased dramatically over the past decade, many of the computational tools and techniques for identifying genes remain the same [8, 9, 16, 17], and are often based on the tiny, and rapidly decreasing, proportion of the data for which there is experimental evidence. Contamination, misassemblies and the difficulty of automatically annotating draft genomes lead to annotation errors. Such errors accumulate and can come to dominate gene databases, particularly if they are subsequently used to inform the annotation of new genomes [7, 8].

Prokaryote gene annotation pipelines rely on only a handful of algorithms to predict coding sequences (CDSs). These algorithms often struggle to account for fragmented assemblies, leading to misannotations and inconsistent annotations, even given identical gene sequences [7]. As most pangenome clustering algorithms only consider protein sequence, these out-of-frame errors result in erroneous orthologs. Popular algorithms such as Prodigal, Glimmer and GeneMarkS incorporate a training step that adapts the algorithm to the features of the genome being annotated [18–21]. While this improves the accuracy of prediction, it can lead to inconsistencies in the annotation of identical sequence elements when the background genetic diversity and fragmentation of each genome differs [7, 9]. Similarly, gene annotation pipelines such as Prokka, DFAST and PGAP each make use of different and in some cases user-specified reference databases, which can lead to discordance in the annotation of the resulting CDSs [22–24]. The parameter choices and post-processing steps used in each of these pipelines can also lead to annotation discrepancies [9].

Algorithms that attempt to address the issue of inconsistent annotations include Balrog and Bakta [25, 26]. Balrog is a CDS prediction algorithm, which builds a universal model of prokaryotic genes using a temporal convolutional network trained on a large and diverse set of microbial genomes. By fixing the training step, the algorithm ensures that CDSs will be called consistently in identical regions of the prokaryotic genomes [25]. The Bakta pipeline improves the consistency of annotations between runs by using a large, fixed, taxon-independent database of reference gene sequences [26]. The algorithm also includes steps to remove known spurious CDSs and small open reading frames (sORFs). However, the pipeline relies on Prodigal to call the initial CDS regions and thus previously unobserved inconsistencies introduced at this stage will persist.

Recent advances, such as the Balrog and Bakta methods explained above, are improving our ability to automatically and consistently annotate genes in draft prokaryote assemblies. The improved consistency of these approaches usually comes at the cost of fixing reference or gene prediction models trained on historical data sets: some real genes will have properties divergent from the models used in these programs, such as short genes or those with alternative codon usage biases, and will be misannotated by automated systems. Substantial challenges remain in developing methods that are both able to maintain the consistency of gene annotations and adapt to improved databases, larger numbers of genomes and the identification of previously unobserved genes, or those with anomalous properties.

Clustering orthologs and paralogs

Following annotation, gene sequences need to be clustered into orthologous and paralogous groups. The most recent common ancestor of orthologs can be traced back to a speciation event, whereas paralogs trace their most recent common ancestry to a gene duplication event. Generating accurate clusters is critical to understanding which genomes share common genes and the evolution of pangenomes. Errors in gene annotation, contamination and the wide variation in the diversity of different gene families present considerable challenges in generating accurate clusters.

Pangenome clustering algorithms often make use of clustering or homology detection algorithms such as blast, CD-HIT and mmSeqs2 [27–29]. These are used to generate initial clusters or, in the case of blast, a pairwise distance matrix. An important, but sometimes underappreciated distinction between these tools and pangenome clustering algorithms is that they do not account for paralogous genes or the varying sequence identity of different gene families. Varying the parameters for defining orthologs versus paralogs can have significant effects on the calculation of pangenome sizes [30].

Initial efforts to address this problem primarily focused on accounting for the variance in sequence identity between different gene families [31, 32]. blast or similar fast alignment algorithms are used to generate a distance matrix between all gene pairs. Clustering is then performed using either the Markov clustering algorithm (MCL) or by looking at triangles of best hits [33, 34]. To account for the increasing size of databases, more recent algorithms precluster genes to reduce redundancy prior to generating a distance matrix [30, 35, 36]. Paralogs are usually determined using gene synteny [7, 30, 35] or by considering gene family phylogenies [36, 37]. A number of algorithms do not attempt to resolve paralogous clusters [31, 38].

The importance of accounting for annotation errors as part of the clustering process has led to the development of new algorithms. The Panaroo algorithm uses gene synteny to identify fragmented genes, missing annotations, out-of-frame errors and contamination [7]. The Peppan algorithm performs an initial clustering step before reannotating all genomes to ensure that annotations are consistent [37]. PPanGGoLiN uses gene synteny to correct for fragmented genes but does not account for other sources of error, including paralogs and variance in the sequence diversity of different gene families [38].

While these algorithms can reduce the impact of annotation errors, it is almost certain that some erroneous clusters will remain. Thus, it is essential that downstream analyses account for such errors to avoid biasing our understanding of pangenome dynamics. The largest predicted pangenome is not necessarily the most accurate.

Intergenic regions

Prokaryotic pangenome analysis tools focus almost exclusively on protein-coding sequences. This protein-centric approach is problematic, as it neglects non-coding RNAs and many important features found in intergenic regions (IGRs), such as promoters, terminators and regulatory binding sites. These features have been shown to be under selection and can have important phenotypic implications [39, 40].

Unlike most protein identification tools, algorithms designed to annotate non-coding regions typically rely on prebuilt feature models and do not suffer from the same genome-specific training problems [41–43]. However, erroneous coding annotations can overlap with predicted non-coding RNAs, leading to similar issues [22, 26]. The impact of fragmented assemblies and contamination on the annotation of non-coding sequences is similar and is likely to lead to considerable sources of error.

The clustering of non-coding regions has received relatively little attention. In the majority of cases, intergenic features are clustered using standard clustering algorithms such as CD-HIT and mmSeq2. As stated previously, these tools do not account for errors, gene synteny or differences in the diversity of different intergenic regions and often require modifications to account for in-frame stop codons found in pseudogenes. In contrast, the Piggy algorithm uses gene synteny to classify intergenic regions and implements a similar clustering strategy to Roary. Importantly, Piggy identifies ‘switched’ intergenic regions upstream of conserved genes [44]. While the Piggy algorithm presents a major advance over classic sequence clustering algorithms, its reliance on Roary for the initial gene clustering implies that it will still be impacted by annotation errors. Improvements in the annotation, clustering and analysis of intergenic regions are essential to developing an accurate picture of the dynamics of prokaryotic pangenome evolution.

Pangenome dynamics

Modelling the dynamics governing the evolution of bacterial pangenomes is essential to understanding how bacterial species evolve and adapt. Improved annotation algorithms and error-aware gene clustering pipelines can substantially reduce the rate at which errors introduce artificial orthologous and paralogous gene clusters or gene deletions in bacterial pangenome analyses. Nevertheless, regardless of the initial bioinformatics pipeline, it is likely that a number of erroneous gene clusters remain. This presents a considerable problem as most downstream methods for analysing bacterial pangenome dynamics do not account for errors.

The impacts of population structure and sampling bias are also often neglected in pangenome models. Similar to GWAS and phylodynamic analysis, failing to consider population structure can significantly bias results, leading to incorrect interpretations [11–13].

Defining the core genome

Orthologous and paralogous gene clusters are classified by how common they are in a particular species. The two most common classifications are: ‘core’, which refers to genes present in all or almost all genomes of a taxonomic unit (usually a species), and ‘accessory’, which refers to genes present in only a subset of genomes. Various sub-categories are also frequently used, including ‘rare’ genes that are observed in a single genome as well as the ‘soft core’, ‘extended core’ or ‘stabilome’, which refer to genes observed in the majority of genomes [35, 38]. These categories allow for a small amount of variation caused by assembly, annotation or clustering errors.

Typically, gene clusters are classified into categories based on predefined thresholds on the fraction of genomes they appear in. The default setting in Roary, one of the most popular pangenome analysis tools, uses a threshold of 95 % to classify core genes [35]. A reliance on strict thresholds does not allow for variance in the error rate between analyses, or the fact that errors multiply with increasing size of the data set, and fails to consider information on the underlying temporal and genetic diversity of the genomes being considered. As an example, a strict threshold is likely to classify most genes in a hospital outbreak of a bacterium as core but would classify far fewer core genes when analysing a diverse set of samples from the same species representing thousands of years of evolution.

To avoid the use of arbitrary thresholds, the PPanGGoLiN pangenome pipeline uses an expectation–maximization algorithm to partition gene clusters into a number of groups based on their presence–absence patterns and gene synteny [38]. The approach can use either a predefined number of gene categories or the number can be estimated as part of the analysis. This is a major improvement over the use of arbitrary thresholds as the algorithm is able to adapt to the error rate and dynamics of different data sets. However, other than by considering gene presence and absence, the algorithm does not incorporate information on the genetic diversity of the underlying samples, which makes it challenging to compare results between data sets.

Accounting for population structure in determining the core genome is critical to understanding whether analysis results from one data set can be generalized to a wider population [45]. Similar to determining the date of the most recent common ancestor, it is not possible to generalize results beyond the set of lineages that are represented in a sample. A potential alternative strategy is to consider gene essentiality instead of gene conservation. Essential genes are necessary for the survival of a bacterium. This definition is appealing as it is less sensitive to sampling biases. However, it introduces new challenges, as determining whether a gene is essential requires expensive and time-consuming experiments and essentiality can also be growth condition- and lineage-dependent [46]. New methods and techniques are needed to more concretely define what a ‘core’ genome is that better accounts for both population structure and erroneous gene clusters.

Gene exchange rates and open vs closed pangenomes

The rate at which genes are gained and lost forms a critical component of the dynamics of pangenome evolution and relates directly to the diversity of genes that can be found in a genome of a lineage or species [47]. HGT also has important implications for a bacterium’s ability to adapt to new niches and, in the case of pathogens, interventions such as vaccines and drug treatments.

Rarefaction curves are commonly used to investigate differences in gene diversity and the rates of HGT in bacterial pangenomes. This approach is taken from ecology, where the number of unique species identified is plotted against the number of samples taken. In such studies, false positives or the misidentification of new species is rare, as such classifications are usually heavily scrutinized (as in the case of the platypus [48]). In contrast, gene annotation errors are frequent and can significantly bias these plots (Fig. 2d, e). In ecological studies, the number of samples is usually representative of the sampling effort. The same is generally not true for bacterial genomic studies, where samples are often collected from hospitals or other convenient locations leading to strong population structure (Fig. 2a–c).

Fig. 2.

Fig. 2.

The impact of annotation errors (d–e) and population structure (a–c) on rarefaction curves and the definition of the core genome. The simulated dynamics of the red and blue data sets in (d, e) are the same. The introduction of errors, including incorrect and missing annotations (red), biases these plots. This leads to the incorrect conclusion that the parameters driving the dynamics differ between the two data sets. Similarly, the green curve and bar indicate the results if only the clade in (a) is considered versus the entire phylogeny, shown in purple. The underlying population structure leads to the inference of a larger core genome and lower gene diversity, which may cause a misinterpretation of the pangenome dynamics, despite the parameters of the two data sets being identical.

In addition, to quantify the diversity of bacterial pangenomes, researchers are often interested in how well this diversity has been sampled. This has led to the classification of pangenomes as either ‘open’ or ‘closed’. Open pangenomes have a diverse accessory genome, where novel gene clusters are identified with each additional genome sequenced. Conversely, a pangenome is described as closed if it has a limited accessory genome and most genes have already been observed. Typically, the binary classification of pangenomes into these two categories relies on a limited number of samples and does not consider the population structure of the sampled genomes. Instead, methods borrowed from information theory are used, such as Heaps’ law [1, 49]. Similar to estimating the date of the most recent common ancestor (MRCA) of a species, where it is only possible to estimate the date of the MRCA of the samples, it is only possible to estimate pangenome parameters for the set of sampled genomes. Ignoring both population structure and errors in the presence and absence of genes can significantly bias our understanding of the dynamics of pangenome gene diversity, gain and loss.

An alternative strategy to the use of rarefaction curves and the binary classification of pangenomes into open and closed is to consider gene gain and loss rates. This allows for simpler comparisons between data sets and can account for both population structure and erroneous gene clusters. Two common models of gene gain and loss include the finitely and infinitely many gene models (FMG and IMG, respectively). The FMG model assumes that the same gene can be gained and lost more than once and is drawn from a finite set of genes [50]. The IMG model assumes an infinite pool of available genes and that each gene can be gained at most once [51–53]. The size of collections of gene clusters (gene families) has also been modelled using a birth–death process [54]. These models usually rely on a phylogeny built from the core genome diversity to control for the underlying population structure.

While models are available (but infrequently used) to control for population structure, the impact of errors in the analysis of gene gain and loss has received less attention. Exceptions include extensions to the birth–death model of gene families that account for errors [55, 56], in addition to techniques that model gene gain and loss rates using generalized linear regression [57].

Pangenome selection

Our understanding of the role selection plays in shaping bacterial pangenomes is still developing. A major difference between the analysis of selection in pangenomes versus allelic variation is that pangenome selection can act at both the genome level or on individual genetic elements at the ‘genic’ level. An example of this is selfish mobile genetic elements that can have both neutral and deleterious effects on the host cell’s fitness [58].

These dynamics and the fact that multiple genes are often gained and lost at once make it challenging to extend classic neutral population genetics theory to the analysis of pangenomes. Niche adaptation occurring within bacterial species with large effective population sizes has been suggested as a plausible explanation for a large amount of within-species gene variation [59]. An alternative explanation for the correlation between within-species gene diversity and effective population size is that neutral evolution leads to populations with larger effective population sizes having a greater diversity of gene content [60]. The role of genic selection is also thought to play a role in the diversity of pangenomes [58]. The contribution of each of these mechanisms to bacterial pangenome evolution is debated [61–63]. Similar to our understanding of gene gain and loss rates, both population structure and erroneous gene clusters must be considered to accurately characterize the role of selection in pangenome evolution.

Several strategies are being used to disentangle the selective forces driving the evolution of pangenomes. One approach uses the co-occurrence of genes in different lineages to look for evidence of selection. Co-occurring genes can be associated with functional categories that enhance a bacterium’s ability to survive in a particular niche [64]. Accounting for population structure is an essential step in identifying co-evolving genes [64]. Alternative approaches have used approximate Bayesian statistical methods to fit simulation models to pangenome gene presence–absence patterns [4]. While population structure is often considered in such analyses, the impacts of erroneous gene clusters are unclear. In order to obtain an accurate understanding of bacterial pangenome dynamics, future methods will need to be robust to both population structure and errors in the annotation and clustering of pangenomes.

Discussion

Understanding the evolutionary dynamics of prokaryotic pangenomes is key to answering several fundamental questions in microbial evolution and ecology. Pangenome dynamics also have important implications for the design of interventions targeting drug resistance, virulence and vaccine evasion of major pathogens. While there has been a dramatic increase in the number of sequenced prokaryotic genomes, there have been comparatively few advances in the development of bioinformatics and modelling methods used to analyse pangenomes.

We have outlined some major challenges in the analysis of prokaryotic pangenomes. Recent advances in the prediction of protein structure and automated annotation could lead to major improvements in our ability to accurately characterize pangenomes [65]. However, while improvements in the design of bioinformatics algorithms to annotate, cluster and classify genes will certainly improve analyses, it is essential that downstream modelling strategies are robust to population structure and errors introduced in the initial bioinformatics stage. Simulation can be used as an effective tool to test the sensitivity of mathematical models to the impacts of errors in the input data. This will be critical as we develop a better understanding of the evolution of prokaryotic pangenomes and the role of selection at both the genome and gene levels.

Funding information

This work was supported by the Research Council of Norway (grant 2 999 131 to G.T.H. and J.C.) and the European Research Council (grant 742 158 to J.C.).

Conflicts of interest

The authors declare that there are no conflicts of interest.

Footnotes

Abbreviations: CDS, coding sequence; FMG, finitely many genes; GWAS, genome wide association study; HGT, horizontal gene transfer; ICE, integrative conjugative elements; IME, integrative mobilizable elements; IMG, infinitely many genes; IS, insertion sequences; ORF, open reading frame; SNP, single nucleotide polymorphism.

References

  • 1.Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome.”. Proc Natl Acad Sci. 2005;102:13950–13955. doi: 10.1073/pnas.0506758102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Welch RA, Burland V, Plunkett G, Redford P, Roesch P, et al. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli . Proc Natl Acad Sci. 2002;99:17020–17024. doi: 10.1073/pnas.252529799. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Thomas CM, Nielsen KM. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat Rev Microbiol. 2005;3:711–721. doi: 10.1038/nrmicro1234. [DOI] [PubMed] [Google Scholar]
  • 4.Corander J, Fraser C, Gutmann MU, Arnold B, Hanage WP, et al. Frequency-dependent selection in vaccine-associated pneumococcal population dynamics. Nat Ecol Evol. 2017;1:1950–1960. doi: 10.1038/s41559-017-0337-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Donati C, Hiller NL, Tettelin H, Muzzi A, Croucher NJ, et al. Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol. 2010;11:R107. doi: 10.1186/gb-2010-11-10-r107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Luo C, Walk ST, Gordon DM, Feldgarden M, Tiedje JM, et al. Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species. Proc Natl Acad Sci. 2011;108:7200–7205. doi: 10.1073/pnas.1015622108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, et al. Producing polished prokaryotic pangenomes with the panaroo pipeline. Genome Biol. 2020;21:180. doi: 10.1186/s13059-020-02090-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Salzberg SL. Next-generation genome annotation: we still struggle to get it right. Genome Biol. 2019;20:92. doi: 10.1186/s13059-019-1715-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Dimonaco NJ, Aubrey W, Kenobi K, Clare A, Creevey CJ. No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study. Bioinformatics. 2022;38:1198–1207. doi: 10.1093/bioinformatics/btab827. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Warren AS, Archuleta J, Feng W-C, Setubal JC. Missing genes in the annotation of prokaryotic genomes. BMC Bioinformatics. 2010;11:131. doi: 10.1186/1471-2105-11-131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Dearlove BL, Xiang F, Frost SDW. Biased phylodynamic inferences from analysing clusters of viral sequences. Virus Evol. 2017;3:vex020. doi: 10.1093/ve/vex020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lees JA, Vehkala M, Välimäki N, Harris SR, Chewapreecha C, et al. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat Commun. 2016;7:12797. doi: 10.1038/ncomms12797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Earle SG, Wu C-H, Charlesworth J, Stoesser N, Gordon NC, et al. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat Microbiol. 2016;1:16041. doi: 10.1038/nmicrobiol.2016.41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol. 2018;36:875–879. doi: 10.1038/nbt.4227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Colquhoun RM, Hall MB, Lima L, Roberts LW, Malone KM, et al. Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs. Genome Biol. 2021;22:267. doi: 10.1186/s13059-021-02473-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, et al. Extensive error in the number of genes inferred from draft genome assemblies. PLoS Comput Biol. 2014;10:e1003998. doi: 10.1371/journal.pcbi.1003998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Dong Y, Li C, Kim K, Cui L, Liu X. Genome annotation of disease-causing microorganisms. Brief Bioinform. 2021;22:845–854. doi: 10.1093/bib/bbab004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with glimmer. Bioinformatics. 2007;23:673–679. doi: 10.1093/bioinformatics/btm009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119. doi: 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Salzberg SL, Delcher AL, Kasif S, White O. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 1998;26:544–548. doi: 10.1093/nar/26.2.544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Besemer J, Lomsadze A, Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001;29:2607–2618. doi: 10.1093/nar/29.12.2607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–2069. doi: 10.1093/bioinformatics/btu153. [DOI] [PubMed] [Google Scholar]
  • 23.Tanizawa Y, Fujisawa T, Nakamura Y. DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics. 2018;34:1037–1039. doi: 10.1093/bioinformatics/btx713. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, et al. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016;44:6614–6624. doi: 10.1093/nar/gkw569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Sommer MJ, Salzberg SL. Balrog: A universal protein model for prokaryotic gene prediction. PLoS Comput Biol. 2021;17:e1008727. doi: 10.1371/journal.pcbi.1008727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, et al. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb Genom. 2021;7:000685. doi: 10.1099/mgen.0.000685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026–1028. doi: 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]
  • 29.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
  • 30.Bayliss SC, Thorpe HA, Coyle NM, Sheppard SK, Feil EJ. PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria. Gigascience. 2019;8:giz119. doi: 10.1093/gigascience/giz119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. doi: 10.1101/gr.1224503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.O’Brien KP, Remm M, Sonnhammer ELL. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 2005;33:D476–80. doi: 10.1093/nar/gki107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–1584. doi: 10.1093/nar/30.7.1575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. doi: 10.1126/science.278.5338.631. [DOI] [PubMed] [Google Scholar]
  • 35.Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31:3691–3693. doi: 10.1093/bioinformatics/btv421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Ding W, Baumdicker F, Neher RA. panX: pan-genome analysis and exploration. Nucleic Acids Res. 2018;46:e5. doi: 10.1093/nar/gkx977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Zhou Z, Charlesworth J, Achtman M. Accurate reconstruction of bacterial pan- and core genomes with PEPPAN. Genome Res. 2020;30:1667–1679. doi: 10.1101/gr.260828.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Gautreau G, Bazin A, Gachet M, Planel R, Burlot L, et al. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol. 2020;16:e1007732. doi: 10.1371/journal.pcbi.1007732. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Molina N, van Nimwegen E. Universal patterns of purifying selection at noncoding positions in bacteria. Genome Res. 2008;18:148–160. doi: 10.1101/gr.6759507. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Khademi SMH, Sazinas P, Jelsbak L. Within-host adaptation mediated by intergenic evolution in Pseudomonas aeruginosa . Genome Biol Evol. 2019;11:1385–1397. doi: 10.1093/gbe/evz083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004;32:11–16. doi: 10.1093/nar/gkh152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–2935. doi: 10.1093/bioinformatics/btt509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Edgar RC. PILER-CR: fast and accurate identification of CRISPR repeats. BMC Bioinformatics. 2007;8:18. doi: 10.1186/1471-2105-8-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Thorpe HA, Bayliss SC, Sheppard SK, Feil EJ. Piggy: a rapid, large-scale pan-genome analysis tool for intergenic regions in bacteria. Gigascience. 2018;7:1–11. doi: 10.1093/gigascience/giy015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Horesh G, Taylor-Brown A, McGimpsey S, Lassalle F, Corander J, et al. Different evolutionary trends form the twilight zone of the bacterial pan-genome. Microb Genom. 2021;7:000670. doi: 10.1099/mgen.0.000670. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Rosconi F, Rudmann E, Li J, Surujon D, Anthony J, et al. A bacterial pan-genome makes gene essentiality strain-dependent and evolvable. Nat Microbiol. 2022;7:1580–1592. doi: 10.1038/s41564-022-01208-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Arnold BJ, Huang I-T, Hanage WP. Horizontal gene transfer and adaptive evolution in bacteria. Nat Rev Microbiol. 2022;20:206–218. doi: 10.1038/s41579-021-00650-4. [DOI] [PubMed] [Google Scholar]
  • 48.Shaw G, Nodder FP. The Duck-Billed Platypus, Platypus anatinus. The Naturalist’s Miscellany. 1789;10:385–386. doi: 10.5962/p.304567. [DOI] [Google Scholar]
  • 49.Tettelin H, Riley D, Cattuto C, Medini D. Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol. 2008;11:472–477. doi: 10.1016/j.mib.2008.09.006. [DOI] [PubMed] [Google Scholar]
  • 50.Zamani-Dahaj SA, Okasha M, Kosakowski J, Higgs PG. Estimating the frequency of horizontal gene transfer using phylogenetic models of gene gain and loss. Mol Biol Evol. 2016;33:1843–1857. doi: 10.1093/molbev/msw062. [DOI] [PubMed] [Google Scholar]
  • 51.Collins RE, Higgs PG. Testing the infinitely many genes model for the evolution of the bacterial core genome and pangenome. Mol Biol Evol. 2012;29:3413–3425. doi: 10.1093/molbev/mss163. [DOI] [PubMed] [Google Scholar]
  • 52.Baumdicker F, Hess WR, Pfaffelhuber P. The infinitely many genes model for the distributed genome of bacteria. Genome Biol Evol. 2012;4:443–456. doi: 10.1093/gbe/evs016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Baumdicker F, Pfaffelhuber P. The infinitely many genes model with horizontal gene transfer. Electron J Probab. 2014;19 doi: 10.1214/EJP.v19-2642. [DOI] [Google Scholar]
  • 54.Hahn MW, De Bie T, Stajich JE, Nguyen C, Cristianini N. Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res. 2005;15:1153–1160. doi: 10.1101/gr.3567505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Han MV, Thomas GWC, Lugo-Martinez J, Hahn MW. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Mol Biol Evol. 2013;30:1987–1997. doi: 10.1093/molbev/mst100. [DOI] [PubMed] [Google Scholar]
  • 56.Mendes FK, Vanderpool D, Fulton B, Hahn MW. CAFE 5 models variation in evolutionary rates among gene families. Bioinformatics. 2020:btaa1022. doi: 10.1093/bioinformatics/btaa1022. [DOI] [PubMed] [Google Scholar]
  • 57.Tonkin-Hill G, Gladstone RA, Pöntinen AK, Arredondo-Alonso S, Bentley SD, et al. Robust analysis of prokaryotic pangenome gene gain and loss rates with panstripe. Genome Res. 2023;33:129–140. doi: 10.1101/gr.277340.122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Douglas GM, Shapiro BJ. Genic selection within Prokaryotic pangenomes . Genome Biol Evol. 2021;13:evab234. doi: 10.1093/gbe/evab234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.McInerney JO, McNally A, O’Connell MJ. Why prokaryotes have pangenomes. Nat Microbiol. 2017;2:17040. doi: 10.1038/nmicrobiol.2017.40. [DOI] [PubMed] [Google Scholar]
  • 60.Andreani NA, Hesse E, Vos M. Prokaryote genome fluidity is dependent on effective population size. ISME J. 2017;11:1719–1721. doi: 10.1038/ismej.2017.36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Shapiro BJ. The population genetics of pangenomes. Nat Microbiol. 2017;2:1574. doi: 10.1038/s41564-017-0066-6. [DOI] [PubMed] [Google Scholar]
  • 62.Vos M, Eyre-Walker A. Are pangenomes adaptive or not? Nat Microbiol. 2017;2:1576. doi: 10.1038/s41564-017-0067-5. [DOI] [PubMed] [Google Scholar]
  • 63.McInerney JO, McNally A, O’Connell MJ. Reply to “The population genetics of pangenomes.”. Nat Microbiol. 2017;2:1575. doi: 10.1038/s41564-017-0068-4. [DOI] [PubMed] [Google Scholar]
  • 64.Whelan FJ, Rusilowicz M, McInerney JO. Coinfinder: detecting significant associations and dissociations in pangenomes. Microb Genom. 2020;6:e000338. doi: 10.1099/mgen.0.000338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, et al. Highly accurate protein structure prediction with alphafold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Microbial Genomics are provided here courtesy of Microbiology Society

RESOURCES