Skip to main content
mSystems logoLink to mSystems
. 2025 Dec 12;11(1):e01007-25. doi: 10.1128/msystems.01007-25

Gempipe: a tool for drafting, curating, and analyzing pan and multi-strain genome-scale metabolic models

Gioele Lazzari 1, Giovanna E Felis 1,2, Elisa Salvetti 1,2, Matteo Calgaro 1, Francesca Di Cesare 1,2, Bas Teusink 3, Nicola Vitulo 1,
Editor: Saheed Imam4
PMCID: PMC12817921  PMID: 41384741

ABSTRACT

Genome-scale metabolic models (GSMMs) can mechanistically explain phenotypic differences among closely related bacterial strains. However, high-throughput multi-strain reconstructions of GSMMs are still challenging: reference-based methods inherit curated information while missing new contents; alternatively (universe-based), reference-free methods could cover strain-specific reactions, but they disregard curated information. Ideally, references should be curated pan-GSMMs for species (or genus), but their reconstruction is extremely demanding, making them still rare in the literature. Here, Gempipe is presented, a computational tool streamlining the multi-strain reconstruction and analysis of GSMMs, going through the production of a pan-GSMM. Its reconstruction method is hybrid; as an optional reference, GSMM is automatically expanded with extra reactions taken from a reference-free reconstruction. Gempipe also downloads, filters, and annotates genomes; performs in-depth gene recovery; annotates models’ contents; and predicts strain-specific capabilities. The companion programming interface includes functions ranging from the (pan-)GSMMs’ curation to the multi-strain analysis. Gempipe was validated using multi-strain data sets, showing improved accuracy when compared with state-of-the-art tools. Moreover, metabolic diversities within Limosilactobacillus reuteri were explored, grouping strains into metabolically coherent clusters and systematically predicting health-related metabolites’ biosynthesis.

IMPORTANCE

Available genome-scale metabolic model (GSMM) reconstruction tools present major limitations in the context of multi-strain modeling. Gempipe surpasses these limitations by implementing a novel, hybrid reconstruction strategy. Not only does it produce more accurate strain-specific GSMMs, but it also produces pan-GSMMs when the only available reference is a manually curated model for a single strain, which is currently the most common case. With the vast availability of genome sequences, the high-throughput, multi-strain GSMM reconstruction and analysis approach provided by Gempipe will facilitate large-scale studies of exploration and bioprospecting of strain-level bacterial metabolic diversity, moving a step forward in strains’ screening and rational selection.

KEYWORDS: genome-scale metabolic models, strain-level metabolic biodiversity, bioprospecting

INTRODUCTION

Different strains of the same bacterial species can exhibit marked differences at the phenotypic level, such as the ability to catabolize different substrates, the presence of specific auxotrophies, or the acquisition of biosynthetic pathways through lateral gene transfer (1, 2).

Genome-scale metabolic models (GSMMs) are systems-biology tools that describe the metabolic potential encoded by a genome. Assuming a steady state and specifying a biomass composition, their constraint-based simulations enable predictions of cellular growth under specific nutritive inputs (3). Therefore, the availability of a GSMM for each strain of interest enables in silico screenings of phenotypic characteristics, offering a faster and more cost-effective alternative to traditional experimental methods, generating hypotheses to be subsequently validated experimentally.

Unfortunately, the creation of high-quality GSMMs is time-expensive because manual curation is required (4). This bottleneck remains a key driver for the development of automated tools (59), some of which quickly gained traction due to their ability to produce simulation-ready GSMMs, using a reference-free, universe-based top-down approach (10).

Since the first pioneering work by Monk and colleagues on Escherichia coli (11), the general steps for the multi-strain reconstruction of GSMMs have remained mostly the same (1218). Briefly, genomes are collected and filtered for quality; then, genes are predicted and clustered, creating orthologous gene families sometimes referred to as the pangenome; one of the strains includes a high-quality, manually curated GSMM that is used as reference; for each strain, a copy of the reference is made and all the genes that do not have an ortholog with the reference are subtracted, consequently removing associated metabolic reactions; finally, gap-filling is usually limited to minimal media as the starting genomes are quality-filtered and excessive gap-fillings could hide true strain-specificities.

The above reference-based method was formalized in 2019 by Norsigian et al. (19), where metabolic functions are inherited from the reference GSMM after orthologous genes are detected via a blastp (20) best reciprocal hits (BRH) alignment. The method was then implemented in Bactabolize (21), a recent tool published in 2023. Overall, its efficacy is dependent on the availability of a curated and phylogenetically close GSMM taken as reference.

However, the method has a key limitation: output GSMMs contain subsets of the reactions in the reference, excluding unmodeled strain-specific reactions. Indeed, the protocol (19) requires a manual curation of the output GSMMs, adding new reactions that were not originally in the reference, a step that was not automated in Bactabolize (21). For this reason, to fully capture strain-specific metabolic features, a curated pan-GSMM that encompasses the metabolic diversity of the entire species (or genus) should be provided as a reference instead of a strain-specific GSMM, as suggested in Bactabolize.

However, while curated strain-specific GSMMs are time-consuming to produce and therefore often lacking for non-model organisms, comprehensive pan-GSMMs are even more challenging to obtain. When manually curated, pan-GSMMs can require years of development (16, 22, 23), and indeed they are still rare in the literature (21): the few covered species include Klebsiella pneumoniae (23), Escherichia coli (24), Salmonella enterica (12), and Bacillus subtilis (17, 18).

Given the vast number of strain-specific genome sequences now available and the scarcity of comprehensive pan-GSMMs, there is a growing need for tools performing multi-strain reconstructions efficiently, even in the absence of a pan-GSMM. These tools should not only be capable of using a reference strain as a starting point, but also of autonomously integrating new, strain-specific reactions to fully capture the metabolic diversity across strains.

In this work, Gempipe is introduced, a novel package that, to the best of our knowledge, is the first to offer pan and multi-strain reconstruction of GSMMs by implementing a hybrid reconstruction method where an optional reference GSMM is automatically expanded with new contents transferred from an independent reference-free reconstruction. Gempipe also provides additional features: retrieval and quality-filtering of genomes; gene annotation; in-depth gene recovery; re-annotation of modeled contents; a companion application programming interface (API) helping the manual curation of (pan-)GSMMs; an “autopilot” mode skipping the (recommended) manual curation; various flux-balance analysis (FBA)-based predictions of strain-specific metabolic features; dedicated API functions for the multi-strain analysis. In this sense, Gempipe is not only a reconstruction tool, but also an analysis tool in the context of biodiversity exploration/bioprospecting.

MATERIALS AND METHODS

From genomes to gene clusters

Gempipe is composed of three command-line programs, “gempipe recon,” “gempipe derive,” and “gempipe autopilot,” along with a Python API (Fig. 1). “gempipe recon” reconstructs draft pan-GSMMs and supports four types of inputs: proteomes in Genbank format, proteomes in FASTA format, genome assemblies in FASTA format, and a list of NCBI Species Taxonomy IDs (taxids). When taxids are given in input, all the available assemblies for the indicated species are automatically downloaded from Genbank. Each genome/proteome is treated as a separate strain.

Fig 1.

Workflow diagram of Gempipe for metabolic model creation. Process shows reconstruction of pan-GSMM and presence-absence matrix. Strain- and species-specific GSMMs are produced for multi-strain analyses after manual curation or automatic gap-filling.

Overview of Gempipe. “gempipe recon” creates a draft pan-GSMM and the presence-absence matrix (PAM), containing the information relative to the protein clustering; “gempipe derive” derives strain-specific GSMMs, starting from the PAM and the pan-GSMM; the latter should be manually curated beforehand, for example, by using dedicated functions of the Gempipe API. When using “gempipe autopilot,” the manual curation is substituted by an automatic gap-filling of the draft pan-GSMM. Finally, the Gempipe API can be used once again to perform multi-strain analyses. When strains of different species are inputted in the same run, species-specific GSMMs are also produced, defined by the set of reactions always present in all strain-specific GSMMs belonging to the same species.

When genomes are inputted, they are first subjected to gene annotation running Prodigal v2.6.3+ (25) via Prokka v1.14.6+ (26), the latter used just for its gene naming feature. Next, BUSCO v5.4.0+ (27) is run on the proteomes, indicating the database of interest, which is automatically downloaded, to obtain the number of missing and fragmented expected single-copy orthologs. Subsequently, seqkit stats v2.2.0+ (28) is run to compute the number of contigs and N50 for each assembly. After these steps, strains that do not fulfill all the following user-specified thresholds are discarded from subsequent analysis: maximum number of missing (default 2%) or fragmented (default 100%) BUSCO orthologs, minimum N50 (default 50,000 [29]), and maximum number of contigs (default 200 [29]).

When a proteome is obtained for each strain, amino acid sequences are grouped into clusters based on high global sequence identity (90%). This is done by using CD-HIT v4.8.1+ (30) with parameters -M -g 1 -aL 0.70 -aS 0.70 -d 0 -c 0.90, obtaining a representative sequence for each cluster. Using the clustering information, an initial gene presence/absence matrix (PAM) is created, with the cluster IDs in rows, strains in columns, and IDs of strain-specific member genes in cells.

Next, a three-step gene recovery is applied to mitigate possible errors arising from genome assembly or gene calling. Three scenarios are addressed (see Supplementary Information 1.1.1 at https://zenodo.org/records/17799453https://doi.org/10.5281/zenodo.15544430): (i) a premature stop codon that breaks a protein sequence in two pieces; (ii) sequences located in genomic regions overlooked by the gene caller; (iii) sequences overlapping an annotated gene (31) and a previously overlooked region. The PAM is updated with the recovered sequences. When proteomes are inputted, they are directly used in subsequent analysis, and no gene recovery is performed.

Reference-free draft pan-GSMM generation

The representative sequences of clusters are processed to create a draft pan-GSMM using a reference-free reconstruction approach. This pan-GSMM is used to expand the optional reference GSMM with new strain-specific contents; alternatively, when no reference GSMM is provided, it is directly used in downstream analysis.

This reference-free reconstruction phase is based on the bacterial universes (Gram-positive or negative) provided by CarveMe v1.5.2 (5). Even if Gempipe and CarveMe share the same universes (and underlying gene database), the reference-free reconstruction algorithm differs: Gempipe is more conservative and accounts for different gene isoforms while preserving the original enzyme complexes as defined in BiGG (32), a database of manually curated GSMM, leading to enhanced gene-to-reaction association (GPR) rules (see Supplementary Information 1.1.2 at https://zenodo.org/records/17799453https://doi.org/10.5281/zenodo.15544430).

Expanded reference-based draft pan-GSMM generation

When an optional reference GSMM is provided (together with its associated proteome), it is used as the cornerstone of the reconstruction process. First, reference gene IDs are translated into cluster IDs, making the reference GSMM compatible with the PAM and the previously made reference-free draft pan-GSMM. This translation is based on orthologs determination via blastp (20) BRH alignments between each strain and the reference proteome, similarly to the approach suggested by reference 19 (see Supplementary Information 1.1.3 at https://doi.org/10.5281/zenodo.15544430).

The translated reference GSMM is then expanded with new gene clusters, reactions, and metabolites taken from the reference-free draft pan-GSMM used as a repository of new contents, leading to the production of an expanded reference-based draft pan-GSMM. The latter inherits from the reference key features, such as the non-growth associated maintenance energy (4) and the biomass equation (33), which would be otherwise inherited from the CarveMe universe (5). Moreover, during the expansion phase, curated information contained in the reference GSMM in terms of metabolites’ mass/charge and reactions’ balancing is respected (see Supplementary Information 1.1.4 at https://doi.org/10.5281/zenodo.15544430).

The final draft pan-GSMM is subjected to reannotation of its metabolites, reactions, genes, and Systems Biology Ontology terms (34). This facilitates prospective uses of the output GSMMs and can lead to better scores in the community standard test suite MEMOTE (35). This reannotation is mainly (but not exclusively) based on MetaNetX v4.4 (36) (see Supplementary Information 1.1.5 at https://doi.org/10.5281/zenodo.15544430).

Derivation of strain-specific GSMMs

Once the pan-GSMM has been sufficiently curated (the Gempipe API can be used for this task, see Supplementary Information 1.1.6 at https://doi.org/10.5281/zenodo.15544430), it is inputted into “gempipe derive” together with the PAM, producing a strain-specific GSMM for each strain. Briefly, for each strain, a copy of the pan-GSMM is made. If the strain has no genes in a cluster, the cluster is removed, potentially leading to the loss of associated reactions. Similarly, if all genes in a cluster have premature stop codons, the cluster is removed. Next, reactions are iterated while updating their GPR: each remaining cluster ID is replaced by the corresponding strain-specific genes.

Each strain-specific GSMM is then gap-filled using a user-provided recipe for a medium, preferably minimal, known or assumed to support the growth of all the input strains. More than one recipe can be provided, leading to multiple rounds of gap-filling. If no media file is provided, a generic minimal aerobic medium recipe is used, having glucose, ammonia, phosphate, and sulfate as sole C, N, P, and S sources, respectively. The COBRApy (37) gap-filling algorithm is applied, using the pan-GSMM as a source of reactions and a user-selectable minimum flux through the objective reaction. Moreover, the strain-specific gap-filling step can optionally be skipped, which is useful, for example, when auxotrophies have to be studied on minimal media (24). At this point, strain-specific GSMMs have the minimum requirements to be used in simulations.

Before proceeding with “gempipe derive,” the draft pan-model should be curated, for example, by using dedicated functions of the Gempipe API. As an alternative, strain-specific GSMMs can be seamlessly produced from genomes/proteomes by using “gempipe autopilot,” which skips the manual curation by applying an automated gap-filling to the draft pan-GSMM. This gap-filling is prioritized by using penalties derived from alignment metrics (see Supplementary Information 1.1.7 at https://doi.org/10.5281/zenodo.15544430).

Multi-strain predictions and analyses

Once the strain-specific GSMMs have been obtained, specific metabolic features can be predicted, including the capability to catabolize alternative C, N, P, or S substrates, the presence of auxotrophies for amino acids and vitamins, and the potential biosynthesis of specific metabolites (see Supplementary Information 1.1.8 at https://doi.org/10.5281/zenodo.15544430). These features, together with the presence of reactions in strains, are stored as binary feature tables (BFTs). These tables have strains in columns, binary features in rows, and 1 (feature presence) or 0 (absence) in cells.

The Gempipe API contains dedicated functions for multi-strain analysis, where any number of BFTs can be inputted. Briefly, BFTs are combined into a single table, and the pairwise similarity between strains is then calculated using the Jaccard index. This produces a distance matrix, which is further processed to create a dendrogram using Ward’s agglomerative clustering (11). The latter is referred to as a “phylometabolic tree,” where strains with similar metabolic potential are placed closely. Clusters of metabolically coherent strains can be extracted from a phylometabolic tree and their characteristic features identified (see Supplementary Information 1.1.9 at https://doi.org/10.5281/zenodo.15544430). Tutorials for multi-strain analyses using the Gempipe API are available in the Gempipe documentation.

RESULTS

Models’ contents and similarity between tools

To evaluate contents and validate substrate usage predictions of reconstructed strain-specific GSMMs, three datasets were used: “01_klebsiella,” “02_ralstonia,” and “03_pseudomonas,” composed of 37 strains belonging to the Klebsiella pneumoniae species complex (38), 11 strains belonging to the Ralstonia solanacearum species complex (39), and 36 strains of Pseudomonas chlororaphis, respectively (see Supplementary Information 1.2.1 at https://doi.org/10.5281/zenodo.15544430). Comparisons were made with current state-of-the-art reference-free reconstruction tools, namely CarveMe (5) and gapseq (7), as well as a recent reference-based tool focused on strain-specificity studies, Bactabolize (21) (see Supplementary Information 1.2.2, https://doi.org/10.5281/zenodo.15544430). Comparisons were focused on automation; therefore, manual curation was skipped, and “gempipe autopilot” was used.

Contents of reconstructed strain-specific GSMMs were compared against their relative manually curated reference (Fig. 2A; Fig. S1 at https://doi.org/10.5281/zenodo.15544430). Moreover, every tool was compared against each other using the mean Jaccard index of the reaction content (Fig. 2B) (see Supplementary Information 1.2.3 at https://doi.org/10.5281/zenodo.15544430).

Fig 2.

Bar graph comparing genomic reconstruction tools displaying gene, reaction, metabolite, and exchange reaction counts with hatched areas showing reference overlap. Heat map presents Jaccard similarity indices between tools based on shared reactions.

Comparison of the general reconstruction metrics. “gempipe_rf” indicates Gempipe ran without reference. (A) Content comparison. “G”: number of genes; “R”: number of reactions excluding exchange reactions; “uM”: number of unique metabolites, i.e., not considering their compartment; “exr”: number of exchange reactions. Bar height corresponds to the mean between strains, while error bars represent standard deviations. Hatched area represents the contents in common with the reference. (B) Similarity between tools based on reaction content. Cells report the mean Jaccard index along the strains, computed for the reaction IDs.

Leveraging its hybrid reconstruction method (expanded reference-based), Gempipe had a generally better reference coverage than reference-free tools, comparable to purely reference-based tools (Bactabolize). On the other hand, output models went beyond the reference with a higher number of modeled contents, aligning with the reference-free tools. The mean Jaccard index confirmed a remarkable reference coverage for Gempipe, where the apparent lower performances compared to Bactabolize are only due to the addition of new reactions during the reference expansion phase. When run without a reference, Gempipe’s models were more similar to the ones produced by CarveMe, likely due to the shared BiGG-based database. However, the introduction of reactions and metabolites in Gempipe is clearly more conservative when using default parameters. gapseq models, despite having a consistently higher number of metabolites, were the most divergent from the reference. However, the conversion between SEED and BiGG IDs provided by MetaNetX (36) is not perfect, so the coverage metrics reported for gapseq could have been underestimated.

Phenotype prediction accuracy

To evaluate the ability to recapitulate phenotypic traits, publicly available binarized Biolog PM data were used as a benchmark (Fig. 3; Fig. S2; Table S1; Supplementary Information 1.2.4 at https://doi.org/10.5281/zenodo.15544430), where the kinetic signal was converted into a binary response “can grow”/“cannot grow.”

Fig 3.

Bar charts comparing substrate utilization prediction metrics including true and false results between experimental and simulated Biolog PM growth assays for Klebsiella pneumoniae, Ralstonia solanacearum, and Pseudomonas chlororaphis strains.

Comparison between experimental and simulated Biolog PM growth assays. Bar height corresponds to the mean between strains, while error bars represent standard deviations. “gempipe_rf” indicates Gempipe ran without reference. The “_pan” suffix indicates the use of a manually-curated pan-GSMM as the reference model, instead of a strain-specific one. Data set composition is the following: “01_klebsiella,” 37 strains belonging to the Klebsiella pneumoniae species complex; “02_ralstonia,” 11 strains belonging to the Ralstonia solanacearum species complex; “03_pseudomonas,” 36 strains of Pseudomonas chlororaphis. (A) Outcome of the comparison considering single substrates. TP, true positive; TN, true negative; FP, false positive; FN, false negative. (B) Overall metrics for the substrate utilization prediction.

In general, the accuracy of Gempipe was better or in line compared to the other tools (Fig. 3B), particularly when using its hybrid reconstruction mode. While CarveMe was close to Gempipe in terms of mean accuracy, it was observed that its internal gap-filling algorithm, designed to “enforce network connectivity” (5), tended to maximize the number of substrates for which growth is predicted; this led not only to fewer FNs (Fig. 3A), resulting in better recall (Fig. 3B), but also to more FPs, resulting in detrimental specificity. Despite using the same CarveMe’s assets (gene database and reaction universes), the different implementation of Gempipe better represented the no-growth phenotypes. Benefits are even more evident when modeling genera still without reference GSMMs deposited in BiGG (32), as they cannot be represented in CarveMe’s assets, like, for example, in “02_ralstonia.” Despite the lowering of the default identity threshold and the medium-specific gap-filling step in common with the other tools (see Supplementary Information 1.2.2 at https://doi.org/10.5281/zenodo.15544430), Bactabolize-generated reaction networks resulted excessively gapped in the data set “02_rastonia,” preventing any positive growth prediction (both TPs and FPs) and questioning the utility of Bactabolize-generated models for this single data set.

When a manually curated pan-GSMM was used as the reference model instead of a strain-specific one, the performances of both Gempipe and Bactabolize substantially improved (Fig. 3). Specifically, the purely reference-based reconstruction method implemented in Bactabolize, which subsets a single, coherent set of reference reactions, led to slightly better performances compared to Gempipe. The latter, instead, has to cope with the expansion of the reference, which can introduce spurious reactions affecting simulations. While investing time and resources in the manual curation of pan-models surely pays off in the long term (23), Gempipe provides the overall best-performing option when only a strain-specific curated model is available as a reference, which is the most common case, and simultaneously provides the first draft of the pan-model, thanks to its reference-expansion capabilities, on which the manual curation can subsequently be based.

In addition to substrate usage predictions, gene essentiality predictions were also evaluated (Fig. 4; Fig. S3; Table S2 at https://doi.org/10.5281/zenodo.15544430). In this case, three other different data sets were used, providing transposon insertion sequencing outcomes used as benchmark: “04_streptococcus,” “05_pseudomonas,” and “06_pseudomonas,” composed of 17 strains of Streptococcus pneumoniae grown on THY rich medium (40), 9 strains of Pseudomonas aeruginosa grown on LB rich medium, and the same 9 strains grown on M9 minimal medium (41), respectively (see Supplementary Information 1.2.1 at https://doi.org/10.5281/zenodo.15544430).

Fig 4.

Bar charts showing comparison between simulated and experimental gene-essentiality assays across bacterial strains in different media. It displays prediction outcomes with accuracy metrics for Streptococcus pneumoniae and Pseudomonas aeruginosa strains.

Comparison between experimental and simulated gene essentiality assays. Bar height corresponds to the mean between strains, while error bars represent standard deviations. “gempipe_rf” indicates Gempipe ran without reference. Data set composition is the following: “04_streptococcus,” 17 strains of Streptococcus pneumoniae grown on rich THY medium; “05_pseudomonas,” 9 strains of Pseudomonas aeruginosa grown on rich LB medium; “06_pseudomonas,” the same 9 strains of P. aeruginosa but grown on minimal M9 medium. (A) Comparison between modeled genes (background gray bars, with gray point representing individual strains) and modeled genes for which an experimental outcome is available (colored bar, with black points representing individual strains). (B) Outcome of the comparison considering single genes. TP, true positive; TN, true negative; FP, false positive; FN, false negative. (C) Overall metrics for the gene essentiality prediction.

In general, Gempipe modeled a higher number of genes compared to the other tools when leveraging its hybrid reconstruction mode (Fig. 4A). In the context of gene essentiality, reference-based approaches seemed to give better performances; however, while both Gempipe and Bactabolize led to higher precision, the number of modeled genes was much lower in Bactabolize, despite the lowering of the default identity threshold (see Supplementary Information 1.2.2 at https://doi.org/10.5281/zenodo.15544430). Again, the particular gap-filling strategy of CarveMe (see above) emerged: in this context, it hindered the detection of essential genes, leading to lower recall, with respect to the reference-free run of Gempipe, which uses the exact same CarveMe’s assets, but a more conservative gap-filling (Fig. 4B); this difficulty in detecting essential genes is even more accentuated when the genus is not accounted for in BiGG (32), nor consequently in CarveMe’s assets, like, for example, “04_streptococcus.” In this scenario, Gempipe gave a clear improvement in overall accuracy; otherwise, it was in line with the other tools (Fig. 4C). Finally, when reconstruction and simulation of the same strains were based on a minimal medium instead of a rich one, the number of essential genes increased as expected (compare “05_pseudomonas” with “06_pseudomonas”).

Remaining orphan reactions

The quality of reconstructions may also be evaluated by the number of modeled metabolic reactions (not exchanges, sinks, nor demands) which have not been associated with genes and, at the same time, have not been labeled as spontaneous. These reactions, also known as “orphan” reactions (42), can be left by internal gap fillers of automated reconstruction tools to improve network connectivity (5). Ideally, the number of orphans should be minimized by manually checking and reassociation to the corresponding genes: a high number of orphans may indicate insufficient curation and excessive reliance on gap-filling.

The presence of orphan reactions was compared (Fig. 5), and Gempipe reconstructions contained the lowest fraction in every data set. In Bactabolize reconstructions, orphans were copied from the reference, but the fraction in “02_ralstonia” and “03_pseudomonas” data sets was inflated due to the gap-filling. In Gempipe, orphans are also copied from the reference, but during the reference expansion phase, their GPRs are supplemented with missing gene clusters taken from the independent reference-free reconstruction. This resulted in a fraction of orphans lower than the reference in all three data sets, and even lower with respect to the reference-free tools. With respect to CarveMe, in particular, the difference seemed to be more accentuated when the reference was not part of the BiGG collection (32), as in the “02_ralstonia” data set. When Gempipe was run without a reference, the number of orphans reached its minimum; in this reconstruction mode, remaining orphans are a consequence of the two biomass-centered gap-filling steps: the first applied to the draft pan-GSMM, and the second to the strain-specific GSMMs.

Fig 5.

Bar chart compares reactions modeled as orphans across strains. Error bars show standard deviations. Hatched segments indicate orphans shared with reference model.

Relative number of modeled reactions with no GPR (orphans). Exchange, sink, and demand reactions are excluded, as well as reactions containing the substring “diffusion” in their name. Bar height corresponds to the mean between strains, while error bars represent standard deviations. The hatched area represents orphans in common with the reference.

Metabolic biodiversity of Limosilactobacillus reuteri

Limosilactobacillus reuteri (Lr, formerly Lactobacillus reuteri) is a species of lactic acid bacteria (LAB) adapted to the gastrointestinal tract (GIT) of vertebrates and widely studied for its probiotic potential (43). Recently, six subspecies of Lr were formally proposed by Li et al. (43) and characterized both phylogenetically and phenotypically. Subspecies reflect the host: Lr subsp. reuteri is adapted to humans and herbivores; Lr subsp. kinnaridis to humans and poultry; Lr subsp. porcinus and Lr subsp. suis to pig; Lr subsp. murium and Lr subsp. rodentium to rodents (43).

Gempipe was used to explore the metabolic biodiversity of Lr at the strain level. A total of 1,056 Lr assemblies, of which 597 derived from metagenomes (MAGs), were retrieved. Contigs belonging to other species were removed from MAGs (see Supplementary Information 1.3.1 at https://doi.org/10.5281/zenodo.15544430). The 545 genomes that remained after taxonomy and quality filtering (see Table S3 at https://doi.org/10.5281/zenodo.15544430) were assigned to subspecies based on ANI thresholds reported in reference 43: 69 resulted classified as kinnaridis, 63 as reuteri, 52 as rodentium, 21 as suis, 2 as porcinus, 115 as murium, while the remaining were not classified (see Fig. S4 at https://doi.org/10.5281/zenodo.15544430).

A curated GSMM for Lr JCM1112 (44) was used as a reference in Gempipe (see Supplementary Information 1.3.2 at https://doi.org/10.5281/zenodo.15544430). Due to the hybrid reconstruction mode, it was expanded with new strain-specific contents, generating a draft pan-GSMM. From the latter, 545 GSMMs were derived. BFTs for metabolic reactions, auxotrophies, and growth on alternative C sources were used to build a phylometabolic tree (see Fig. S5 at https://doi.org/10.5281/zenodo.15544430). Since vitamin B12 is unrelated to the ecological niche/subspecies (45), the 24 reactions forming the B12 biosynthetic pathway were removed before generating the tree. A number of clusters equal to the subspecies (6) was extracted. Clusters were generally consistent with the subspecies: reuteri and porcinus were substantially contained in Cluster_3, kinnaridis in Cluster_5, suis in Cluster_1, murium and rodentium in Cluster_2 (Fig. 6A).

Fig 6.

Phylometabolic tree reveals six strain clusters aligned with subspecies classifications. Analysis shows varying production potential for health metabolites vitamin B12, reuterin, histamine, and urease presence across different bacterial strains.

Multi-strain analysis using the Gempipe API functions. (A) Phylometabolic tree built using presence/absence data for reactions, auxotrophies, and growth on alternative C substrates. Six clusters of metabolically coherent strains were extracted from the tree and shown alongside the subspecies attribute. Only features not constant across strains are represented. (B) Potential strain-specific production of health-related metabolites, vitamin B12 (“adeadocbl_c”), reuterin (“3hppnl_c”), and histamine (“hista_c”), and presence of urease (“UREA”). The subspecies attribute is reported.

Consumption data for 49 C substrates characterizing the six subspecies were reported by Li et al. (43) (“general” data set). These data were derived by the same authors from experiments on 12 strains, two for each subspecies (“specific” data set) (see Table S4 at https://doi.org/10.5281/zenodo.15544430). The “specific” data set was compared with strain-specific simulations, which resulted in 93% accuracy (mean computed on 11 strains, as GCA_000712565.2 was quality-filtered). The “general” data set was compared with the feature’s relative frequency within clusters (see Supplementary Information 1.3.3 at https://doi.org/10.5281/zenodo.15544430); this comparison revealed interesting discrepancies between literature and modeled phenotypes.

For example, growth on galactose is reported as species capability (43); however, ~8% of strains (45), mainly belonging to murium and rodentium subspecies, were predicted to be unable to catabolize this substrate. Upon further investigation, these strains were consistently lacking the aldose 1-epimerase, possibly explaining the deficient phenotype (see Supplementary Information 2.1 at https://doi.org/10.5281/zenodo.15544430). Among compared substrates, D-xylose was particularly interesting for three reasons: (i) its catabolic route was not in the reference GSMM, but it was automatically included by Gempipe during the reference expansion phase; (ii) its utilization was predicted to have high variability across strains, ranging from ~9% in murium to ~88% in rodentium; (iii) its simulation accuracy with the “specific” data set was 100%. For instance, only ~38% of the 21 strains strictly classified as suis were predicted to grow on D-xylose, while this is reported to hold for the entire subspecies (43). This further corroborated the idea that phenotypic descriptions of species/subspecies provided in literature are actually not always valid, possibly because they are based on a number of strains (two in this case) too low for generalization, or due to the limited reliability of phenotypic testing (46). In this context, the GSMM-based analysis gave more comprehensive results than traditional phenotypic characterization.

In humans, Lr is found in many sites other than the GIT, including breast milk, skin, and urinary tract. Lr is reported as a probiotic, and some of the metabolites it produces have a proven health-related effect (47, 48). For example, reuterin (a mixture of different forms of 3-hydroxypropionaldehyde) is released by several strains and exerts antimicrobial activity against Gram-negative bacteria, making Lr effective against GIT infections (47, 49). Histamine is another strain-specific metabolite, acting as an intestinal immunomodulator and anti-inflammatory agent (47, 49). Vitamin B12 (cobalamin) is an essential vitamin in humans, introduced with the diet; only four B12-producing Lr strains were clearly identified as of 2018 (49). Apart from health-related metabolite production, other metabolic features are of interest to assess adaptation strategies in different hosts; one of them is the conversion of urea to ammonium and CO2 (urease), likely involved in acid resistance (50).

Gempipe enabled a systematic evaluation of these four key metabolic features in all the 545 filtered strains (Fig. 6B). The majority of rodent-associated strains (Lr subsp. murium and rodentium) were predicted to be capable of urease activity, probably needed for survival in rodent GIT (51, 52); this is in accordance with a recent genome-wide association study (53). Moreover, most but not all of the human- and poultry-associated strains were predicted as potential producers of reuterin and vitamin B12, with the underlying genes being indeed part of the same pdu-cbi-cob-hem gene cluster (51). Interestingly, histamine production was an exclusive feature of subsp. reuteri (43), with only a few reuteri strains lacking this trait.

DISCUSSION

In the present work, Gempipe was introduced, a multi-purpose package for pan and multi-strain genome-scale metabolic modeling. It adopts a hybrid reconstruction approach that lies between reference-free and reference-based methods, as an optional reference is expanded with new contents coming from a universal GSMM. Together with an internal clustering of strain-specific genes, an effective in-depth gene recovery (see Supplementary Information 2.2 at https://doi.org/10.5281/zenodo.15544430), and conservative generation of GPRs, the implemented approach proved to be effective for multi-strain reconstructions, with better or similar performances compared to current established reconstruction tools, when focusing on metabolic features without considering manual curation. A summary of the main features of Gempipe compared to the benchmarked tools is provided in Table S8, https://doi.org/10.5281/zenodo.15544430.

Gempipe represents a third option in the panorama of reconstruction tools: reference-based methods use a manually curated model (or sorted list of models, from the phylogenetically closest to the most curated) to be used as template, from which reactions are copied after ortholog genes are determined (39, 54, 55); instead, reference-free methods use a semi-curated universal model with a generic biomass equation, from which reactions are copied and gap-filled based on sequence homology (one-way alignment) (5, 7). From the first approach, Gempipe inherits the ortholog determination to translate reference genes into equivalent gene clusters; from the second, Gempipe inherits the homology-based insertion of new, strain-specific reactions, with GPRs already based on gene clusters.

In the case of multi-strain reconstruction, reference-based methods are usually applied (1118). The main concern, in this context, is that the template must be representative of the metabolic diversity of the entire species (or genera) (21); otherwise, strain-specific reactions must be added afterward to each generated model (19). This is why a comprehensive (pan) GSMM is usually curated prior to running reference-based reconstruction tools (23). However, such pan-GSMMs representing a species (or genus) are complex to build and still rare in the literature (12, 18, 23, 24), with as little as four species reported (21). Indeed, in some GSMM-based biodiversity studies, strain-specific models have been used as a reference (1517). This is a limiting approach, as generated models will just be a subset of a single strain. In this context, Gempipe was able to grasp strain-specificities better than purely reference-based methods like Bactabolize (21) when a strain-specific GSMM is used as reference instead of a pan-GSMM, which is a common case. Moreover, inheriting contents from a manually curated reference, the models generated by Gempipe provided more accurate predictions with respect to those built with reference-free methods like CarveMe (5) and gapseq (7). All this was achieved while minimizing the number of orphan reactions and maximizing the MEMOTE metrics (35) (see Supplementary Information 2.3 at https://doi.org/10.5281/zenodo.15544430).

Given the limited number of pan-GSMMs, Gempipe also emerges as a valuable tool to quickly build and curate pan-GSMMs to be used in biodiversity or bioprospecting studies. It must be noted, however, that the concept of pan-GSMM was also introduced in the context of metagenomics-derived models (56), which is remarkably different. In this context, indeed, pan-GSMMs are instrumental to cope with the incompleteness and contamination of MAGs, and strain-specificity (in terms of genes and reactions) is lost in favor of a consensus/mean reconstruction representing a species-level genome bin (56). Instead, in the GSMM-guided exploration of biodiversity, the context where Gempipe operates, strain-specificity must be retained and emphasized as it is needed for the subsequent generation of strain-specific GSMMs (19).

In the development of Gempipe, particular attention was put into how metabolic features are encoded in the model. In this regard, it must be noted that metrics commonly used to evaluate and compare reconstructions (accuracy, precision, recall, specificity) (5, 7, 21) do not take into consideration the faithfulness of the reaction network in representing the organism. Indeed, behind a true positive match with the experimental data (e.g., Biolog screenings), there could be cases of (i) wrong reaction mechanisms (e.g., erroneous transporter types); (ii) presence of metabolic reactions not supported by genes (orphans); (iii) reactions with seriously impaired GPRs, lacking many components of a protein complex. Provided that manual curation still remains essential for obtaining truthful GSMMs, users should be aware of the reconstruction principles that each tool follows to draft a model. Gempipe includes reactions only when a protein complex is fully supported by genes; otherwise, mismatches (false negatives) will lead the manual curation in closing the gaps. On the other side, tools providing (i) internal gap-filling procedures aimed to improve the network connectivity and (ii) too permissive mechanisms of GPR generation and reaction inclusion could lead not only to predict a higher number of growth-supporting substrates (higher recall and lower specificity) but, most importantly, to match the experimental data with a biologically inaccurate representation of the metabolism, which can be easily overlooked during the manual curation.

Finally, Gempipe is not meant to be just a reconstruction tool, but also an analysis tool in the context of biodiversity exploration. Indeed, while the Gempipe API includes functions to curate models, it also contains functions to analyze the deck of strain-specific GSMMs in output. For example, functions are included to cluster strains according to their metabolism and to visually compare metabolic clusters with respect to other attributes, such as niche metadata or the formal species or subspecies classification. This aids users to achieve goals including (i) the screening of strains for desired metabolic traits, (ii) the classification of strains according to their metabolic capabilities, and (iii) the definition of species and subspecies, and possibly other taxonomic ranks, in terms of their core metabolic potential. In this context, the case study on L. reuteri here reported provided insights also from a taxonomic point of view (see Supplementary Information 3.1 at https://doi.org/10.5281/zenodo.15544430).

Limitations of Gempipe are mostly due to the resources it relies on. The BiGG namespace (57) was adopted as it was convenient for two main reasons: (i) the availability of high-quality, manually curated reference models based on the same namespace (32); (ii) the human-readable IDs, particularly useful during manual curation, when GSMM-based metabolic maps have to be hand-drawn or interpreted (58, 59) (for example, D-glucose is “glc__D,” immediately recognizable with respect to its ModelSEED (60) equivalent “cpd00027”). While convenient, the BiGG database has imperfections due to its structure: it is not a coherent biochemical database, but rather a collection of models from which a biochemical database is derived. Depending on the model of origin, the same metabolite or reaction can be defined differently. Consequences are many: (i) the same reaction can be represented with different reversibility (e.g., “ILETA2”); (ii) the same metabolite can be represented with different IDs, leading to duplicate metabolites (e.g., “ind_c”/“indole_c”); (iii) the same reaction can use duplicate metabolites, leading to duplicate reactions (e.g., “TRPS2”/“TRPS2_1”); (iv) metabolites with the same ID can be represented with different chemical formula or charge (e.g., “fmn_c”). Therefore, the integration of a BiGG-based model with reactions coming from another BiGG-based model can lead to the introduction of unbalanced reactions or, even worse, stoichiometric inconsistencies (61). Gempipe tries to circumvent this issue by enabling users to superimpose particular metabolite charges and formulas or reaction balances during the reference expansion phase.

Another limitation, depending on BiGG (32), is the limited representation of bacterial diversity, mainly in terms of genes. Indeed, in its current version (v1.6), BiGG contains as little as 108 GSMMs, of which 88 are prokaryotic, and only 22 do not belong to the Escherichia or Shigella genera. The set of BiGG genes, used by Gempipe and CarveMe (5), is therefore biased on model species and does not cover much bacterial diversity, potentially leading to missed genes (and thus reactions) during the reference-free reconstruction phase. Gempipe tries to limit this problem by comparing the eggNOG-mapper (62) functional annotation of clusters’ representative sequences (see Supplementary Information 1.1.2 at https://doi.org/10.5281/zenodo.15544430). In future versions, the set of BiGG genes could be expanded by considering BiGG-compliant GSMMs stored in BioModels (63).

GSMMs are useful tools to catch and explore strain-level differences in metabolism. Secondary metabolite production, like antibiotic production, can in principle be described by GSMMs, as long as the underlying biosynthetic pathways can be dissected in terms of stoichiometric equations and involved enzyme complexes. However, to date, the coverage of secondary metabolism in GSMMs is generally poor, even in manually curated models. Indeed, aside from a few well-studied groups, such as polyketides and nonribosomal peptides, many biosynthetic pathways for strain-specific secondary metabolites present knowledge gaps even in large, general-purpose metabolic databases like MetaCyc (64) and KEGG (65). The representation of secondary metabolism is then even worse in GSMM reconstruction tools, especially in BiGG-based tools (see above). Therefore, the modeling of secondary metabolism still heavily relies on manual curation, which, however, may remain insufficient due to the current knowledge gaps. Another major challenge comes from the modeling framework itself. Plain FBA, which does not account for gene regulation, can have good predictive or descriptive potential only for growth-coupled metabolites under steady state. This means that classic GSMM/FBA often fail to compute fluxes in secondary metabolisms, as they are typically tightly gene-regulated and triggered during the stationary phase, in stress conditions, or by environmental signals. Interested readers are referred to reference 66 for one of the most comprehensive and updated reviews on the topic.

Further development on Gempipe will benefit from the introduction of new universes, such as one for yeasts (67), and from new API functions for the analysis of the deck of strain-specific GSMMs created.

In conclusion, Gempipe will facilitate metabolic biodiversity studies for a wide range of bacterial species, including those not having a dedicated pan-GSMM, which are currently the large majority.

ACKNOWLEDGMENTS

This work was supported by the European Union—NextGenerationEU, Mission 4, Component 2, Investment 1.1, under the PRIN PNRR 2022 call, CUP code B53D23024920001, project code P20229JMMH.

Contributor Information

Nicola Vitulo, Email: nicola.vitulo@univr.it.

Saheed Imam, LifeMine Therapeutics, Cambridge, Massachusetts, USA.

DATA AVAILABILITY

Gempipe can be easily installed using the dedicated conda package: https://anaconda.org/bioconda/gempipe. Its source code is freely available on GitHub: https://github.com/lazzarigioele/gempipe. Comprehensive documentation for the command-line programs and the API is available on ReadTheDocs, where ad hoc tutorials are also included: https://gempipe.readthedocs.io/en/latest/. Code to reproduce validation and case study is available on a separate GitHub repository: https://github.com/lazzarigioele/paper_gempipe. The source code of Cocoremover is available on GitHub: https://github.com/lazzarigioele/cocoremover. Supplementary information, supplementary figures, supplementary tables, and all the code used in this paper are available in Zenodo: https://doi.org/10.5281/zenodo.15544430.

REFERENCES

  • 1. Domingo-Sananes MR, McInerney JO. 2021. Mechanisms that shape microbial pangenomes. Trends Microbiol 29:493–503. doi: 10.1016/j.tim.2020.12.004 [DOI] [PubMed] [Google Scholar]
  • 2. Li W, Wu Q, Kwok L, Zhang H, Gan R, Sun Z. 2024. Population and functional genomics of lactic acid bacteria, an important group of food microorganism: current knowledge, challenges, and perspectives. Food Frontiers 5:3–23. doi: 10.1002/fft2.321 [DOI] [Google Scholar]
  • 3. O’Brien EJ, Monk JM, Palsson BO. 2015. Using genome-scale models to predict biological capabilities. Cell 161:971–987. doi: 10.1016/j.cell.2015.05.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Thiele I, Palsson BØ. 2010. A protocol for generating a high-quality genome-scale metabolic reconstruction. Nat Protoc 5:93–121. doi: 10.1038/nprot.2009.203 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Machado D, Andrejev S, Tramontano M, Patil KR. 2018. Fast automated reconstruction of genome-scale metabolic models for microbial species and communities. Nucleic Acids Res 46:7542–7553. doi: 10.1093/nar/gky537 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Henry CS, DeJongh M, Best AA, Frybarger PM, Linsay B, Stevens RL. 2010. High-throughput generation, optimization and analysis of genome-scale metabolic models. Nat Biotechnol 28:977–982. doi: 10.1038/nbt.1672 [DOI] [PubMed] [Google Scholar]
  • 7. Zimmermann J, Kaleta C, Waschina S. 2021. Gapseq: informed prediction of bacterial metabolic pathways and reconstruction of accurate metabolic models. Genome Biol 22:81. doi: 10.1186/s13059-021-02295-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Wang H, Marcišauskas S, Sánchez BJ, Domenzain I, Hermansson D, Agren R, Nielsen J, Kerkhoven EJ. 2018. RAVEN 2.0: a versatile toolbox for metabolic network reconstruction and a case study on Streptomyces coelicolor. PLoS Comput Biol 14:e1006541. doi: 10.1371/journal.pcbi.1006541 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Capela J, Lagoa D, Rodrigues R, Cunha E, Cruz F, Barbosa A, Bastos J, Lima D, Ferreira EC, Rocha M, Dias O. 2022. Merlin, an improved framework for the reconstruction of high-quality genome-scale metabolic models. Nucleic Acids Res 50:6052–6066. doi: 10.1093/nar/gkac459 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Mendoza SN, Olivier BG, Molenaar D, Teusink B. 2019. A systematic assessment of current genome-scale metabolic reconstruction tools. Genome Biol 20:158. doi: 10.1186/s13059-019-1769-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Monk JM, Charusanti P, Aziz RK, Lerman JA, Premyodhin N, Orth JD, Feist AM, Palsson BØ. 2013. Genome-scale metabolic reconstructions of multiple Escherichia coli strains highlight strain-specific adaptations to nutritional environments. Proc Natl Acad Sci USA 110:20338–20343. doi: 10.1073/pnas.1307797110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Seif Y, Kavvas E, Lachance J-C, Yurkovich JT, Nuccio S-P, Fang X, Catoiu E, Raffatellu M, Palsson BO, Monk JM. 2018. Genome-scale metabolic reconstructions of multiple Salmonella strains reveal serovar-specific metabolic traits. Nat Commun 9:3771. doi: 10.1038/s41467-018-06112-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Bosi E, Monk JM, Aziz RK, Fondi M, Nizet V, Palsson BØ. 2016. Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicity. Proc Natl Acad Sci USA 113:E3801–E3809. doi: 10.1073/pnas.1523199113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Nogales J, Mueller J, Gudmundsson S, Canalejo FJ, Duque E, Monk J, Feist AM, Ramos JL, Niu W, Palsson BO. 2020. High-quality genome-scale metabolic modelling of Pseudomonas putida highlights its broad metabolic capabilities. Environ Microbiol 22:255–269. doi: 10.1111/1462-2920.14843 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Norsigian CJ, Kavvas E, Seif Y, Palsson BO, Monk JM. 2018. iCN718, an updated and improved genome-scale metabolic network reconstruction of Acinetobacter baumannii AYE. Front Genet 9:121. doi: 10.3389/fgene.2018.00121 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Hawkey J, Vezina B, Monk JM, Judd LM, Harshegyi T, López-Fernández S, Rodrigues C, Brisse S, Holt KE, Wyres KL. 2022. A curated collection of Klebsiella metabolic models reveals variable substrate usage and gene essentiality. Genome Res 32:1004–1014. doi: 10.1101/gr.276289.121 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Blázquez B, San León D, Rojas A, Tortajada M, Nogales J. 2023. New insights on metabolic features of Bacillus subtilis based on multistrain genome-scale metabolic modeling. Int J Mol Sci 24:7091. doi: 10.3390/ijms24087091 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Neal M, Brakewood W, Betenbaugh M, Zengler K. 2024. Pan-genome-scale metabolic modeling of Bacillus subtilis reveals functionally distinct groups. mSystems 9:e0092324. doi: 10.1128/msystems.00923-24 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Norsigian CJ, Fang X, Seif Y, Monk JM, Palsson BO. 2020. A workflow for generating multi-strain genome-scale metabolic models of prokaryotes. Nat Protoc 15:1–14. doi: 10.1038/s41596-019-0254-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421. doi: 10.1186/1471-2105-10-421 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Vezina B, Watts SC, Hawkey J, Cooper HB, Judd LM, Jenney AWJ, Monk JM, Holt KE, Wyres KL. 2023. Bactabolize is a tool for high-throughput generation of bacterial strain-specific metabolic models. eLife 12:RP87406. doi: 10.7554/eLife.87406 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Liao Y-C, Huang T-W, Chen F-C, Charusanti P, Hong JSJ, Chang H-Y, Tsai S-F, Palsson BO, Hsiung CA. 2011. An experimentally validated genome-scale metabolic reconstruction of Klebsiella pneumoniae MGH 78578, iYL1228. J Bacteriol 193:1710–1717. doi: 10.1128/JB.01218-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Cooper HB, Vezina B, Hawkey J, Passet V, López-Fernández S, Monk JM, Brisse S, Holt KE, Wyres KL. 2024. A validated pangenome-scale metabolic model for the Klebsiella pneumoniae species complex. Microb Genom 10:001206. doi: 10.1099/mgen.0.001206 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Monk JM. 2022. Genome-scale metabolic network reconstructions of diverse Escherichia strains reveal strain-specific adaptations. Philos Trans R Soc Lond B Biol Sci 377:20210236. doi: 10.1098/rstb.2021.0236 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119. doi: 10.1186/1471-2105-11-119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069. doi: 10.1093/bioinformatics/btu153 [DOI] [PubMed] [Google Scholar]
  • 27. Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. 2021. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol 38:4647–4654. doi: 10.1093/molbev/msab199 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Shen W, Le S, Li Y, Hu F. 2016. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One 11:e0163962. doi: 10.1371/journal.pone.0163962 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Rajput A, Chauhan SM, Mohite OS, Hyun JC, Ardalani O, Jahn LJ, Sommer MO, Palsson BO. 2023. Pangenome analysis reveals the genetic basis for taxonomic classification of the Lactobacillaceae family. Food Microbiol 115:104334. doi: 10.1016/j.fm.2023.104334 [DOI] [PubMed] [Google Scholar]
  • 30. Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150–3152. doi: 10.1093/bioinformatics/bts565 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Dimonaco NJ, Aubrey W, Kenobi K, Clare A, Creevey CJ. 2022. No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study. Bioinformatics 38:1198–1207. doi: 10.1093/bioinformatics/btab827 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Norsigian CJ, Pusarla N, McConn JL, Yurkovich JT, Dräger A, Palsson BO, King Z. 2019. BiGG Models 2020: multi-strain genome-scale models and expansion across the phylogenetic tree. Nucleic Acids Res doi: 10.1093/nar/gkz1054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Feist AM, Palsson BO. 2010. The biomass objective function. Curr Opin Microbiol 13:344–349. doi: 10.1016/j.mib.2010.03.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Courtot M, Juty N, Knüpfer C, Waltemath D, Zhukova A, Dräger A, Dumontier M, Finney A, Golebiewski M, Hastings J, et al. 2011. Controlled vocabularies and semantics in systems biology. Mol Syst Biol 7:543. doi: 10.1038/msb.2011.77 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Lieven C, Beber ME, Olivier BG, Bergmann FT, Ataman M, Babaei P, Bartell JA, Blank LM, Chauhan S, Correia K, et al. 2020. MEMOTE for standardized genome-scale metabolic model testing. Nat Biotechnol 38:272–276. doi: 10.1038/s41587-020-0446-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Moretti S, Tran VDT, Mehl F, Ibberson M, Pagni M. 2021. MetaNetX/MNXref: unified namespace for metabolites and biochemical reactions in the context of metabolic models. Nucleic Acids Res 49:D570–D574. doi: 10.1093/nar/gkaa992 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Ebrahim A, Lerman JA, Palsson BO, Hyduke DR. 2013. COBRApy: COnstraints-based reconstruction and analysis for python. BMC Syst Biol 7:74. doi: 10.1186/1752-0509-7-74 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Blin C, Passet V, Touchon M, Rocha EPC, Brisse S. 2017. Metabolic diversity of the emerging pathogenic lineages of Klebsiella pneumoniae. Environ Microbiol 19:1881–1898. doi: 10.1111/1462-2920.13689 [DOI] [PubMed] [Google Scholar]
  • 39. Baroukh C, Cottret L, Pires E, Peyraud R, Guidot A, Genin S. 2023. Insights into the metabolic specificities of pathogenic strains from the Ralstonia solanacearum species complex. mSystems 8:e0008323. doi: 10.1128/msystems.00083-23 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Poulsen BE, Yang R, Clatworthy AE, White T, Osmulski SJ, Li L, Penaranda C, Lander ES, Shoresh N, Hung DT. 2019. Defining the core essential genome of Pseudomonas aeruginosa. Proc Natl Acad Sci USA 116:10072–10080. doi: 10.1073/pnas.1900570116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Rosconi F, Rudmann E, Li J, Surujon D, Anthony J, Frank M, Jones DS, Rock C, Rosch JW, Johnston CD, van Opijnen T. 2022. A bacterial pan-genome makes gene essentiality strain-dependent and evolvable. Nat Microbiol 7:1580–1592. doi: 10.1038/s41564-022-01208-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Orth JD, Thiele I, Palsson BØ. 2010. What is flux balance analysis? Nat Biotechnol 28:245–248. doi: 10.1038/nbt.1614 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Li F, Cheng CC, Zheng J, Liu J, Quevedo RM, Li J, Roos S, Gänzle MG, Walter J. 2021. Limosilactobacillus balticus sp. nov., Limosilactobacillus agrestis sp. nov., Limosilactobacillus albertensis sp. nov., Limosilactobacillus rudii sp. nov. and Limosilactobacillus fastidiosus sp. nov., five novel Limosilactobacillus species isolated from the vertebrate gastrointestinal tract, and proposal of six subspecies of Limosilactobacillus reuteri adapted to the gastrointestinal tract of specific vertebrate hosts. Int J Syst Evol Microbiol 71:004644. doi: 10.1099/ijsem.0.004644 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Kristjansdottir T, Bosma EF, Branco Dos Santos F, Özdemir E, Herrgård MJ, França L, Ferreira B, Nielsen AT, Gudmundsson S. 2019. A metabolic reconstruction of Lactobacillus reuteri JCM 1112 and analysis of its potential as a cell factory. Microb Cell Fact 18:186. doi: 10.1186/s12934-019-1229-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Lee J-Y, Han GG, Choi J, Jin G-D, Kang S-K, Chae BJ, Kim EB, Choi Y-J. 2017. Pan-genomic approaches in Lactobacillus reuteri as a porcine probiotic: investigation of host adaptation and antipathogenic activity. Microb Ecol 74:709–721. doi: 10.1007/s00248-017-0977-z [DOI] [PubMed] [Google Scholar]
  • 46. Vezina B, Cooper HB, Rethoret-Pasty M, Brisse S, Monk JM, Holt KE, Wyres KL. 2024. A metabolic atlas of the Klebsiella pneumoniae species complex reveals lineage-specific metabolism that supports co-existence of diverse lineages. bioRxiv. doi: 10.1101/2024.07.24.605038 [DOI] [PMC free article] [PubMed]
  • 47. Abuqwider J, Altamimi M, Mauriello G. 2022. Limosilactobacillus reuteri in health and disease. Microorganisms 10:522. doi: 10.3390/microorganisms10030522 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Yu Z, Chen J, Liu Y, Meng Q, Liu H, Yao Q, Song W, Ren X, Chen X. 2023. The role of potential probiotic strains Lactobacillus reuteri in various intestinal diseases: new roles for an old player. Front Microbiol 14:1095555. doi: 10.3389/fmicb.2023.1095555 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Mu Q, Tavella VJ, Luo XM. 2018. Role of Lactobacillus reuteri in human health and diseases. Front Microbiol 9:757. doi: 10.3389/fmicb.2018.00757 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Walter J, Britton RA, Roos S. 2011. Host-microbial symbiosis in the vertebrate gastrointestinal tract and the Lactobacillus reuteri paradigm. Proc Natl Acad Sci USA 108:4645–4652. doi: 10.1073/pnas.1000099107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Frese SA, Benson AK, Tannock GW, Loach DM, Kim J, Zhang M, Oh PL, Heng NCK, Patil PB, Juge N, Mackenzie DA, Pearson BM, Lapidus A, Dalin E, Tice H, Goltsman E, Land M, Hauser L, Ivanova N, Kyrpides NC, Walter J. 2011. The evolution of host specialization in the vertebrate gut symbiont Lactobacillus reuteri. PLoS Genet 7:e1001314. doi: 10.1371/journal.pgen.1001314 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Wilson CM, Loach D, Lawley B, Bell T, Sims IM, O’Toole PW, Zomer A, Tannock GW. 2014. Lactobacillus reuteri 100-23 modulates urea hydrolysis in the murine stomach. Appl Environ Microbiol 80:6104–6113. doi: 10.1128/AEM.01876-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Bujdoš D, Walter J, O’Toole PW. 2025. Aurora: a machine learning gwas tool for analyzing microbial habitat adaptation. Genome Biol 26:66. doi: 10.1186/s13059-025-03524-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Notebaart RA, van Enckevort FHJ, Francke C, Siezen RJ, Teusink B. 2006. Accelerating the reconstruction of genome-scale metabolic networks. BMC Bioinformatics 7:296. doi: 10.1186/1471-2105-7-296 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Battjes J, Melkonian C, Mendoza SN, Haver A, Al-Nakeeb K, Koza A, Schrubbers L, Wagner M, Zeidan AA, Molenaar D, Teusink B. 2023. Ethanol-lactate transition of Lachancea thermotolerans is linked to nitrogen metabolism. Food Microbiol 110:104167. doi: 10.1016/j.fm.2022.104167 [DOI] [PubMed] [Google Scholar]
  • 56. De Bernardini N, Zampieri G, Campanaro S, Zimmermann J, Waschina S, Treu L. 2024. Pan-Draft: automated reconstruction of species-representative metabolic models from multiple genomes. Genome Biol 25:280. doi: 10.1186/s13059-024-03425-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. King ZA, Lu J, Dräger A, Miller P, Federowicz S, Lerman JA, Ebrahim A, Palsson BO, Lewis NE. 2016. BiGG Models: a platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Res 44:D515–D522. doi: 10.1093/nar/gkv1049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. King ZA, Dräger A, Ebrahim A, Sonnenschein N, Lewis NE, Palsson BO. 2015. Escher: a web application for building, sharing, and embedding data-rich visualizations of biological pathways. PLoS Comput Biol 11:e1004321. doi: 10.1371/journal.pcbi.1004321 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Rowe E, Palsson BO, King ZA. 2018. Escher-FBA: a web application for interactive flux balance analysis. BMC Syst Biol 12:84. doi: 10.1186/s12918-018-0607-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Seaver SMD, Liu F, Zhang Q, Jeffryes J, Faria JP, Edirisinghe JN, Mundy M, Chia N, Noor E, Beber ME, Best AA, DeJongh M, Kimbrel JA, D’haeseleer P, McCorkle SR, Bolton JR, Pearson E, Canon S, Wood-Charlson EM, Cottingham RW, Arkin AP, Henry CS. 2021. The ModelSEED Biochemistry Database for the integration of metabolic annotations and the reconstruction, comparison and analysis of metabolic models for plants, fungi and microbes. Nucleic Acids Res 49:D575–D588. doi: 10.1093/nar/gkaa746 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Gevorgyan A, Poolman MG, Fell DA. 2008. Detection of stoichiometric inconsistencies in biomolecular models. Bioinformatics 24:2245–2251. doi: 10.1093/bioinformatics/btn425 [DOI] [PubMed] [Google Scholar]
  • 62. Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. 2021. eggNOG-mapper v2:functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol Biol Evol 38:5825–5829. doi: 10.1093/molbev/msab293 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Malik-Sheriff RS, Glont M, Nguyen TVN, Tiwari K, Roberts MG, Xavier A, Vu MT, Men J, Maire M, Kananathan S, Fairbanks EL, Meyer JP, Arankalle C, Varusai TM, Knight-Schrijver V, Li L, Dueñas-Roca C, Dass G, Keating SM, Park YM, Buso N, Rodriguez N, Hucka M, Hermjakob H. 2019. BioModels—15 years of sharing computational models in life science. Nucleic Acids Res doi: 10.1093/nar/gkz1055 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Caspi R, Altman T, Billington R, Dreher K, Foerster H, Fulcher CA, Holland TA, Keseler IM, Kothari A, Kubo A, Krummenacker M, Latendresse M, Mueller LA, Ong Q, Paley S, Subhraveti P, Weaver DS, Weerasinghe D, Zhang P, Karp PD. 2014. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res 42:D459–D471. doi: 10.1093/nar/gkt1103 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Kanehisa M, Goto S. 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28:27–30. doi: 10.1093/nar/28.1.27 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Qiu S, Yang A, Zeng H. 2000. Flux balance analysis-based metabolic modeling of microbial secondary metabolism: current status and outlook. PLoS Comput Biol 19:e1011391. doi: 10.1371/journal.pcbi.1011391 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Lu H, Kerkhoven EJ, Nielsen J. 2022. A pan-draft metabolic model reflects evolutionary diversity across 332 yeast species. Biomolecules 12:1632. doi: 10.3390/biom12111632 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Gempipe can be easily installed using the dedicated conda package: https://anaconda.org/bioconda/gempipe. Its source code is freely available on GitHub: https://github.com/lazzarigioele/gempipe. Comprehensive documentation for the command-line programs and the API is available on ReadTheDocs, where ad hoc tutorials are also included: https://gempipe.readthedocs.io/en/latest/. Code to reproduce validation and case study is available on a separate GitHub repository: https://github.com/lazzarigioele/paper_gempipe. The source code of Cocoremover is available on GitHub: https://github.com/lazzarigioele/cocoremover. Supplementary information, supplementary figures, supplementary tables, and all the code used in this paper are available in Zenodo: https://doi.org/10.5281/zenodo.15544430.


Articles from mSystems are provided here courtesy of American Society for Microbiology (ASM)

RESOURCES