Abstract
Motivation
The importance and rate of development of genome-scale metabolic models have been growing for the last few years, increasing the demand for software solutions that automate several steps of this process. However, since TRIAGE’s release, software development for the automatic integration of transport reactions into models has stalled.
Results
Here, we present the Transport Systems Tracker (TranSyT). Unlike other transport systems annotation software, TranSyT does not rely on manual curation to expand its internal database, which is derived from highly curated records retrieved from the Transporters Classification Database and complemented with information from other data sources. TranSyT compiles information regarding transporter families and proteins, and derives reactions into its internal database, making it available for rapid annotation of complete genomes. All transport reactions have GPR associations and can be exported with identifiers from four different metabolite databases. TranSyT is currently available as a plugin for merlin v4.0 and an app for KBase.
Availability and implementation
TranSyT web service: https://transyt.bio.di.uminho.pt/; GitHub for the tool: https://github.com/BioSystemsUM/transyt; GitHub with examples and instructions to run TranSyT: https://github.com/ecunha1996/transyt_paper.
1 Introduction
Transport of substances across biological membranes is essential for cell maintenance and growth, allowing the uptake of nutrients (McCracken and Edinger 2013), secretion of toxic compounds and by-products (Axe and Bailey 1995, Kwon and Yun 2014), intercellular communication (Record et al. 2014), maintenance of cellular homeostasis (Park et al. 2004, Rink and Haase 2007), and distribution of metabolites across different intracellular organelles (Versaw and Garcia 2017). Transport engineering has been applied in industrial biotechnology to improve the production efficiency of desired compounds by inducing their extracellular accumulation. The most common strategies include blocking the export of intermediate metabolites (Li et al. 2019), enhancing the transport of the target compound (Doshi et al. 2013), or altering the intracellular transport of metabolites (Cardenas and Da Silva 2016). These engineering strategies are based on information regarding the transporter function and specificity, which are usually available only for model organisms, impairing the utilization of transport engineering for less-characterized organisms.
In the last decades, the number of sequenced genomes has grown exponentially. Several publicly available Bioinformatics tools for reconstructing high-quality Genome-Scale Metabolic (GSM) models were developed or improved to keep up with this progress. However, there is limited availability of reliable tools for automatic annotation of transporter systems (Hamilton and Reed 2014). Databases like the Transporter Classification Database (TCDB) (Saier et al. 2006), TransportDB 2.0 (Elbourne et al. 2017), ARAMEMNON (Schwacke et al. 2003), YTPdb (Brohée et al. 2010), and ABCdb (Quentin and Fichant 2000) contain data regarding transport proteins. TCDB was used as the main source of information by the Transport Reactions Annotation and Generation (TRIAGE) (Dias et al. 2017) due to its high level of curation. It contains structural, functional, mechanistic, evolutionary, and medical information about transport systems. Their authors also proposed the Transport Classification (TC) system, the only classification adopted by the International Union of Biochemistry and Molecular Biology (IUBMB) for transporters until August 2018, when the Enzyme Commission (EC) number 7, Translocases, was added by the organization.
There have been various software approaches developed to address the challenges associated with the annotation of transporters in software development. Some tools were specifically designed to simply address the issue of transporters’ functional annotation, such as Rapid Annotation using Subsystem Technology (RAST) (Aziz et al. 2008) and the Prokaryotic Genome Annotation Pipeline (PGAP), used to annotate prokaryotic genomes for RefSeq (Li et al. 2021).
Other tools are able to systematically identify substrates for transport reactions besides the annotation itself. For instance, Transporter Automatic Annotation Pipeline (TransAAP) (Elbourne et al. 2023), uses homology searches against curated databases, such as TCDB, as well as domain search with Pfams and TIGRFAMs. This tool is used to populate the widely used TransportDB 2.0 database. The Transporter Substrate Specificity Prediction Server (TrSSP) (Mishra et al. 2014) and TranCEP (Alballa et al. 2020), utilize Support Vector Machine (SVM) models and UniProtKB/Swiss-Prot (Bateman et al. 2021) data in its predictions. Another approach was taken by the Transporters via Annotation Transfer by Homology (TransATH) software (Aplop and Butler 2017). This tool employs an automated version of TCDB's (Transporter Classification Database) protocol for annotations. It derives a general representation of the substrates that each system should transport by preprocessing the available substrates in TCDB and grouping them accordingly. Moreover, TransATH leverages manual annotations from TRIAGE’s curated database to enhance its annotations.
In the context of the automatic assembly of transport reactions, Lee et al. (2008) developed the Transport Inference Parser (TIP) in 2008, a method that generates transport reactions based on an organism’s genome annotation through textual analysis techniques. It analyses the name and function of each protein, thus inferring the transport reaction promoted by them.
Regarding the integration of transport information in GSM models, tools like ModelSEED (Henry et al. 2010) and Pathway Tools (Karp 2001), predict transporters based on RAST functional annotations to develop models and add spontaneous reactions to fill in pathways when necessary. Other tools, such as Reconstruction, Analysis and Visualization of Metabolic Networks (RAVEN) (Agren et al. 2013) or SuBliMinaL (Swainston et al. 2011), generate transport reactions based on information retrieved from databases. RAVEN automatically retrieves data from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al. 2004) and the Braunschweig Enzyme Database (BRENDA) (Chang et al. 2021) or existing models of closely related organisms, based on the assumption that related organisms share metabolic capabilities. In addition, RAVEN allows performing manual integration of transport reactions and filling gaps using KEGG Orthology (KO) identifiers to guarantee a functional network. SuBliMinaL optionally provides a default set of transporters retrieved from the Biochemical, Genetic and Genomic (BiGG) (King et al. 2016) knowledgebase, not relying on genomic information. Finally, previous versions of merlin (Capela et al. 2022) used a dedicated tool for this purpose, TRIAGE, which based its annotations on a compilation of manually curated data retrieved from TCDB and predicted transmembrane α-helices.
In this study, we introduce TranSyT, a novel framework for annotating transport proteins and generating their corresponding transport reactions, with a specific emphasis on their integration into GSM models.
2 Materials and methods
TranSyT is a Java™ software composed by two main modules. The first is responsible for regularly gathering, processing, and saving information from different sources. The second module of TranSyT is specifically designed for end-users. It utilizes the information collected and processed by the first module to quickly identify genes that encode transport proteins. The goal of this module is to provide users with the ability to identify these genes in the shortest possible time.
The tool can be accessed via webservice (https://transyt.bio.di.uminho.pt/), or through a Java project (see https://github.com/ecunha1996/transyt_paper for instructions). The tool is also available as a plugin in merlin v4, and a beta app in KBase (Arkin et al. 2018). The TranSyT’s workflow, described below, is represented in Fig. 1.
Figure 1.
Representation of TranSyT’s workflow. Data retrieved from TCDB is complemented by information available at ModelSEED, MetaCyc, KEGG, and BiGG is used to associate TC numbers, with transport reactions, and the respective substrates. This operation is repeated each month, and the information is stored in a Neo4j database. The genome uploaded by the user is used to run a BLAST search, whose results are used to select TC families. Then, two methods are applied to select transport reactions that are filtered by metabolites present in the model or metabolites list uploaded. At the end, genes identified as transport encoding are annotated with a TC number, and a transport reaction. Reactions with complete GPR rules are saved in SBML format
2.1 Data retrieval and transport reactions assembly
TranSyT uses TCDB as the primary source of information. By actively retrieving information on transporter systems from TCDB, the tool ensures that it stays up to date with the latest data available. Simultaneously, it uses MetaCyc (Caspi et al. 2020) and KEGG to complete information regarding the entries found in TCDB. Descriptions of the family and TC numbers are also retrieved, as often the family reactions include several types of transport.
After gathering all the information, TranSyT replaces the compounds described in a generic transport reaction with the substrates mentioned in the corresponding transport systems entry. Additionally, TranSyT seeks evidence of direction (in/out) and reversibility in the TC number entry, subfamily, and family descriptions to write the reactions’ equations.
Despite providing the name of the compounds participating in the transport reactions and the recently added feature “Substrate Search Tool,” at the time of TranSyT’s implementation, TCDB still did not provide cross-references for substrates described on the page of each TC number. As stated by Thiele and Palsson (Thiele and Palsson 2010) in step 14 of the protocol for reconstructing GSM models, the identification of the compounds used in the reaction is of paramount importance, as it introduces difficulties to use these models across different tools and platforms. Thus, to avoid impairing the integration of TranSyT’s output with other software, the compounds are identified when generating the transport reactions database. Hence, BioSynth (Liu 2018), an open-source Bioinformatics software, is used to accomplish this task. TranSyT uses automatic text processing methods to cross BioSynth’s data with substrates’ information retrieved from TCDB to find matches. Each metabolite must follow the process described in Supplementary Fig. SA1 of Supplementary Data S1 to avoid conflicts in the matching process and find matching entries without direct hits. Finally, the databases’ identifiers and chemical formulas for ModelSEED, KEGG, MetaCyc, and BiGG are assigned. Reactions retrieved from MetaCyc already contain information regarding compounds’ identifiers, thus being excluded from this process.
Besides compounds, BioSynth identifies the hierarchical ontologies from MetaCyc, which are used to generate reactions for all hierarchical descendants of the metabolites retrieved from TCDB. For instance, if a TC number describes “sugar” as the transported compound, the identification of all sugars available at MetaCyc is performed, and a transport reaction is generated for each one. However, compounds such as the class “lipid” encompass thousands of descendants. When flagged in TranSyT’s configurations, the depth at which these compounds’ descendants are retrieved can be regulated.
TranSyT also provides a filter to accept substrates common to ontologies of both TC family and TC number to avoid annotating proteins with false positives by crossing the hierarchical descendants of the TC family’s substrates and the substrates of the TC number. When a match is found, the substrate is saved, and the respective reaction is generated. An example of this process’s importance can be shown using TC number 2.A.40.1.3 (UniProt accession P75892). The compound assigned to the TC number entry is “Pyrimidine,” and the generic TCDB family reaction described the transport of a “Nucleobase.” Overriding the filter, TranSyT generates a total of 114 reactions, as it would create a reaction for all 263 metabolites identified by MetaCyc as descendants of a “Pyrimidine.” However, when the filter is applied, this number drops to three reactions. Although the number of descendants of “a Nucleobase” is nine in MetaCyc, both ontologies only have three shared entries: cytosine, uracil, and thymine. A quick validation using UniProt and EcoCyc using records P75892 and G6517, respectively, confirms that this protein is a transporter with a high affinity for uracil and thymine. However, according to UniProt, this protein does not transport cytosine and seems to have a low affinity to transport xanthine (a purine), a compound not assigned by TCDB. Cytosine appears to be a false positive because it matched the ontologies filter. Nevertheless, the reaction describing xanthine transport can be retrieved from MetaCyc, overcoming TCDB’s misannotation.
During the reactions’ generation process, TranSyT searches for evidence of reliability in TCDB’s data, such as descriptions of transport type in the protein’s description and the presence of generic TC family reactions. If these criteria are met, MetaCyc’s data is used when in agreement with TCDB, as several reactions from MetaCyc are automatically assigned without human validation. Otherwise, transport reactions for the compounds retrieved from TCDB are generated according to MetaCyc.
All reactions are tested for mass balance, and only balanced transport reactions are accepted as correct. However, in known cases, the reactions require balance correction due to the origin of the annotation. For instance, ATP-binding cassette (ABC) transporters with metabolites formulae retrieved from MetaCyc, BiGG, and ModelSEED will lack a proton. TranSyT generates reactions based on evidence of the TC family reaction equation and the description of the TC number, subfamily, family, and superfamily. When no evidence is found, the most straightforward mechanism is assumed (reversible uniport). If symport or antiport evidence is found, but no default reaction is available, a co-transport reaction of the metabolites described in the system together with a proton is generated.
After completing the reactions’ generation and respective mass balance validation, all approved reactions are assigned a persistent identifier generated by TranSyT. The reaction identifiers were designed to be intuitive, i.e. the reaction equation’s content is associated with the identifier, whose structure follows a pattern thoroughly described on page 2 in Supplementary Data S1. Not all reactions can be intuitively described by the identifier, like a symport of more than two compounds. In such cases, a sequential number is assigned. At this point, cross-references to ModelSEED reactions’ identifiers are also sought. TranSyT’s unique identifiers are currently registered under the central registry for life science data Identifiers.org (https://registry.identifiers.org/registry/transyt).
When this process is complete, all information regarding the generated reactions is stored in TranSyT’s Neo4j graph database. During the upload process, the organism’s taxonomic identifier is retrieved from the accession number, using NCBI and UniProt APIs, due to the relevance of the phylogenetic information inherent to each transporter system.
2.2 Identification of genes encoding transport systems
TranSyT’s second module uses a genome and the respective taxonomic identifier as mandatory input to start the identification process. The taxonomic identifier allows comparing the taxonomy of the organism encoding the reference proteins with the organism submitted by the user. According to Barghash and Helms (2013), it is more accurate to classify membrane transporters according to TC families than substrates families. Also, according to the same study, two membrane transporters with an expected value score below a threshold of 1e−8 on BLAST are likely to share the same TC family, though a threshold of 1e−4 could also be considered with caution. Hence, TranSyT uses BLAST to identify genes encoding transporter systems. The sequences uploaded by the user are aligned against a database built from the latest TCDB version available. In addition, amino acid sequences of proteins involved in phosphotransferase system (PTS) reactions are added to this file to improve the construction of such reactions’ Gene-Protein-Reaction (GPR) associations, such as KEGG KOs K23993 and K02784. TranSyT’s default run configurations can be found in Supplementary Table SA1 in Supplementary Data S1.
After completing the alignments, the next step is to determine which TC family should be assigned to each reference gene by calculating family scores. Such score is defined by equation 2 in Dias et al. (2017) by constructing a rationale that considers the frequency of hits related to a TC family with the similarity score of such hits. After assigning the family with the highest score to each reference gene, the following step is the association of transport reactions to the genes identified as encoding transport proteins. TranSyT uses two different methods to accomplish this task.
The first method accepts all reactions that fulfill the following conditions:
Reactions associated with a TCDB entry must belong to the annotated TC family.
A TCDB entry hit must have an expected value lower or equal to the automatic acceptance threshold (0 by default) or belong to the top cluster (10% by default) of the best alignments with an expected value above the acceptance threshold and a given lower threshold [1e−50 by default, which is extremely conservative according to Barghash and Helms (2013)]. These parameters can be changed by the user if necessary.
Nevertheless, this method does not consider the hits’ taxonomy due to the evidence of high similarity with the query sequence.
The second method follows an approach similar to TRIAGE’s main algorithm for assigning reactions to transport candidate genes. The goal of this method is to find reactions not found by the first method. This algorithm searches high-frequency reactions among all BLAST hits taxonomically related to the study organism. Reactions with a score above the defined threshold (0.75 by default) are associated with the protein-encoding gene. Detailed descriptions of these calculations and TranSyT’s configurations’ default values can be found on pages 5 and 6 in Supplementary Data S1.
One of the main features of TranSyT is the GPR associations’ algorithm. This software can search protein complexes formed by multiple subunits. However, the process to find subunits encoded by different genes is not straightforward. TranSyT takes advantage of information already available in TCDB regarding protein complexes and the BLAST results to perform this task. Thus, the first step is finding the genes associated with each complex’s subunit, along with the respective bit score. An example of the process is described on page 7 of Supplementary Data S1.
Using this data, the association is direct in cases where the TC number is related to an isoenzyme or a promiscuous enzyme. The reaction will be encoded by only one protein, encoded by one gene. The algorithm will assign the hit with the highest bit score to each subunit for an enzyme complex. The expected value is used as a tiebreaker for multiple hits with the same bit score. When a match is found, the assigned gene will be removed from other subunits’ results. If the available number of genes is lower than the number of subunits during the process, the associated gene is removed from the GPR, and the next highest-scoring gene is processed. This methodology is recursively applied until all subunits are associated with a different gene. If no solution is found, the reaction is removed, as there is no evidence of all subunits in the genome. The results are exported in the Systems Biology Markup Language (SMBL) standard format, together with text files describing the TC family’s annotations.
All variables present in TranSyT’s calculations are parameterizable through configuration files, and the user can override input settings. When the process is finished, the results provide a simple annotation of each gene’s proteins and reactions. Otherwise, when integrating the output with a GSM model, TranSyT can use the list of compounds present in the model to filter the reactions that would not have flux.
2.3 TranSyT performance assessment
Four different case studies were used to assess TranSyT’s performance, in terms of functional annotation and generation of transport reactions. Despite providing support for four different databases, ModelSEED is used as the cornerstone of the internal database as the software is integrated into KBase. Thus, the results were generated using ModelSEED identifiers, except otherwise indicated.
The first case study uses Biolog experimental data retrieved from 103 different genomes. The data specifies substrate utilization by each organism, indicating the existence or absence of transport mechanisms.
The second case study analyses RAST’s annotations for 1671 genomes, comparing, when possible, the type of transport and a sample of compounds between both annotations.
In the third case, annotations provided by TranSyT and TransAAP were compared. The analysis was performed for three organisms: Pseudomonas aeruginosa, Saccharomyces cerevisiae, and Ostreococcus lucimarinus. These organisms belong to three different kingdoms, allowing evaluating how the two tools perform for different taxonomic groups. The protein coding sequences of each organism were retrieved from NCBI Assembly database with the following accession numbers GCF_000006765.1 (P.aeruginosa), GCF_000146045.2 (S.cerevisiae), and GCF_000092065.1 (O.lucimarinus). TranSyT was executed with default parameters (see Supplementary Table SA1 of Supplementary Data S1 to see TranSyT’s default configuration) and with a set of less restrictive parameters (Supplementary Table SA2 of Supplementary Data S1). Then, S.cerevisiae’s genes identified as transport encoding by TranSyT but not by TransAAP were automatically analyzed at Swiss-Prot. For each entry, the Gene Ontology terms, and the keywords were screened. If the words “transport,” “import,” or “export” were present, or if a TCDB identifier was available, the gene/protein was assumed to be associated with a transport mechanism.
The fourth case study thoroughly compares the transport reactions present in the iML1515 (Monk et al. 2017) model of Escherichia coli str. K-12 substr. MG1655 with TranSyT’s annotations for the same genome.
3 Results and discussion
3.1 Growth phenotype data
Biolog provides experimental data regarding the substrates used by an organism, indicating that the organism ought to be endowed with mechanisms that transport such compounds across the external membrane. The report is composed of 103 different organisms with growth/no-growth indications for 67 (64 + 3 particular cases) different substrates. The rationale for using these cases is detailed in Supplementary Data S2.
TranSyT was executed using different sets of parameters for both methods of identification (ran independently). Method one was implemented using default parameters, only changing the percentage (p) of best BLAST results to accept: 0%, 5%, 10%, and 20%. The second identification method also used default parameters, except for the α value set to 0.5 and 0.75.
Table 1 contains a summary of the average results for each method independently and merged with the other approach for each set of parameters. The combined results present a negligible increase in the assignment of reactions to growth compounds when M1p = 0% and M2α = 0.5. However, when increasing M2α to 0.75, the rise is >7% when M1p = 0%, though less than 1% when M1p > 0%. A careful analysis of the results shows that as expected, the largest rise occurs for genomes with smaller representation in TCDB. The most extreme case is Marinobacter adhaerens HP15, with an increase of 20%. This organism has one entry in TCDB, and its genus is not well represented, with 15 entries only. When M1p > 0% and M2α = 0.5, the same increase is not verified, with very few organisms having an increase in the assignment of reactions to growth compounds.
Table 1.
Average growth/no-growth compounds found.a
Method/parameters | Growth |
No-growth | ||
---|---|---|---|---|
Independent | Merged with M2α = 0.5 | Merged with M2α = 0.75 | ||
M1p = 0% | 63.06% | 63.19% | 70.34% | 61.71% |
M1p = 5% | 78.60% | 78.60% | 79.16% | 44.05% |
M1p = 10% | 78.52% | 78.52% | 79.08% | 43.58% |
M1p = 20% | 79.21% | 79.21% | 79.77% | 42.76% |
M2α = 0.5 | 10.00% | 91.24% | ||
M2α = 0.75 | 43.27% | 68.62% |
Growth substrates contain at least one reaction. No-growth substrates had no reactions assigned.
As expected, the increase of M1p and M2α drops the number of True Negatives, i.e. nongrowth substrates without transport reactions assigned by TranSyT. Although these compounds are not associated with the organism’s growth, it does not indicate that these are not transported across the membrane to participate in other metabolic functions. Thus, M1p = 10 and M2α = 0.75 were selected for the next test cases, not being over or under-conservative. Additional information about the results provided here can be found in Supplementary Data S2.
3.2 Analysis of a large set of phylogenetic diverse genomes
The second test case aims at comparing TranSyT with RAST and RefSeq’s annotations for 1671 genomes to validate TranSyT’s approach for prokaryotes. By collecting the annotation for the genes assigned as protein encoding by TranSyT, it was possible to compare the compounds that are transported and the respective mechanism. Only genes annotated as transporters by TranSyT are retrieved as despite other transporters that might be annotated by RAST and RefSeq, as we aim to determine whether TranSyT’s automatic approach results are similar to RAST’s and RefSeq’s annotation. A more detailed and exhaustive comparison, including potential missing annotations, of TranSyT and TransAAp is shown below.
A dataset of 216 890 genes was collected for analysis. By searching keywords associated with transport types in the annotation description, the following categories were assigned to the transport proteins: Simple, ABC, PTS, Cofactor, and oxidation-reduction (Redox). Simple is the most generic category, as it encompasses the simplest transport types: Uniport, Symport, and Antiport. These three were grouped in the same category as it was challenging to differentiate transporters characterized as “Major Facilitator Superfamily” or “porin.” Nevertheless, it was not possible to infer the transport type for around 27% of RAST’s annotations and 35% of RefSeq’s annotations, as it was not possible to automatically assume a transport type (e.g. Shikimate transporter). Moreover, regarding RefSeq’s dataset, almost 10% of the entries were not annotated. The filtered datasets included 150 949 and 134 116 transporters for RAST and RefSeq, respectively.
According to Fig. 2, all categories match over 92% of RAST and 89% of RefSeq’s annotations. One of the cases analyzed was regarding the gene STERM_RS12595, in which RAST’s annotation is ABC transporter. In this case, TranSyT’s BLAST results assigned the gene to TC family 9.B.20. Entries assigned in TC class 9 are usually incompletely characterized and in many instances lacking a TC family generic reaction, causing TranSyT to flag TCDB’s data as unreliable for such cases and use MetaCyc instead. In this specific case, MetaCyc’s reaction was a Uniport created without human supervision. However, the compound in transport agreed with RAST’s annotation. For the same gene, RefSeq’s annotation had no reference to the transport type in its annotation and was therefore not included in the dataset.
Figure 2.
Comparison of TranSyT’s genes transport type annotation against RAST and RefSeq filtered datasets. The bars represent the percentage of genes annotated by TranSyT with identical annotation on either RAST or RefSeq
A similar approach was used to compare the compounds in the filtered dataset (Fig. 3). However, in this case, only RAST annotations were used as the annotations are generally more descriptive, and all genes had an annotation. Cytosine, sulphate, acetate, putrescine, copper, nitrate, and cellulose were selected for the study.
Figure 3.
Comparison of genes associated with the transport of seven randomly selected metabolites between TranSyT and RAST. These metabolites were selected from the shared pool of metabolites obtained with the previous assessments. The bars represent the percentage of genes annotated by TranSyT with identical annotation on RefSeq for each compound
As shown in Fig. 3, cytosine, copper, and nitrate had the lowest match rate, whereas the remaining compounds scored above 86% in agreement with RAST’s annotation. A more detailed analysis of TranSyT’s annotations for cytosine shows that 24.6% of the cases annotated as transporting this compound had the description “Cytosine/purine/uracil/thiamine/allantoin permease family protein” assigned by RAST and TC number 2.A.39.3.4 by TranSyT. For this family, TCDB only describes allantoin as transporting compound. MetaCyc’s reaction is also in agreement with TCDB, describing the homolog as “allantoin permease.”
A second case was analyzed for cytosine in which RAST had the same annotation as before. However, TranSyT assigned the TC number 2.A.39.3.14, for adenine, guanine, uracil, and allantoin transport, though in this case, MetaCyc provided no information. Nevertheless, as TCDB provides cross-references to literature sources, it was possible to find in Schein et al.’s (2013) that this protein cannot transport cytosine and uridine. This situation was observed in 6.6% of genes.
Regarding copper annotations, in 15% of the cases, RAST assigned the genes as “Copper-translocating P-type ATPase (EC 3.6.3.4); Lead, cadmium, zinc, and mercury transporting ATPase (EC 3.6.3.3) (EC 3.6.3.5).” With the advent of EC number class 7, the EC numbers were transferred to EC 7.2.2.9, EC 7.2.2.21, and EC 7.2.2.12, respectively. Running a BLAST against such genes and combining the results with UniProt data, it was possible to determine that the TC numbers with the best results in the alignment are associated with EC numbers 7.2.2.21 and 7.2.2.12. These TC numbers do not contain copper in their descriptions. A relaxation of the parameters would eventually allow TranSyT to include TC numbers with lower similarity related to EC number 7.2.2.9, and consequently, copper reactions.
A third case was analyzed for nitrate, where it was found that for 10% of the entries, RAST annotated these genes as “Nitrate/nitrite transporter.” When running a BLAST of the same genes on NCBI, it was possible to confirm TranSyT’s annotations of “putative tartrate transporter” genes.
3.3 Comparative analysis of TranSyT and TransAAP
Genes identified as transporter-encoding by TranSyT and TransAAP and the respective annotations were compared. The analysis was performed for three organisms: P.aeruginosa, S.cerevisiae, and O.lucimarinus, with two different TranSyT’s configurations. Figure 4 shows Venn diagrams identifying the overlap of genes annotated as transporters by both tools.
Figure 4.
Venn diagrams with genes identified as transporter-encoding by TransAAP and TranSyT, using TranSyT’s default parameters (A) and relaxed parameters (B)
Using default parameters, TranSyT predicted around 55%, 33%, and 77% of the genes identified as transporter encoding by TransAAP for P.aeruginosa, O.lucimarinus, and S.cerevisiae, respectively. When the TranSyT’s parameters are less restrictive, such values increased to 76%, 66%, and 84%, respectively. In this case, the number of genes identified becomes higher than the TransAAP ones for all organisms.
Genes from S.cerevisiae identified by TranSyT but not by TransAAP were further analyzed to validate the transporter classification by checking the respective record at Swiss-Prot. O.lucimarinus and P.aeruginosa were not included in this analysis as the number of entries at this database is relatively low. In the first case, from the 138 genes identified by TranSyT and not TranAAP, 122 are associated with some transport mechanism at Swiss-Prot (88.4%), while for relaxed parameters this percentage decreases to 80.3%.
These results can be explained by the number of transport systems annotated at TCDB for each organism and related species. At the moment, there are 137 systems of P.aeruginosa in this database, while O.lucimarinus has only six, and S.cerevisiae, a model organism for yeast, presents 415 transport systems. Hence, it is recommended to use less restrictive parameters when annotating genomes of less-studied organisms. In addition, TransAAP has a better performance with prokaryotes than with eukaryotes, as claimed by the authors.
3.4 Comparison with published model
TranSyT’s annotations for E.coli str. K-12 substr. MG1655 were assessed against the iML1515 GSM model. This model contains 1516 genes, 2712 reactions, and 1877 metabolites, using BiGG identifiers for its reactions and compounds. However, for consistency with the previous studies, ModelSEED was used in TranSyT’s annotations.
Transport reactions information per gene was retrieved from the model using COBRA toolbox (Schellenberger et al. 2011). All reactions with metabolites in more than one compartment were considered transporters. Therefore, the model contains 833 transport reactions, 780 of which are distributed by 499 genes, with 53 transport reactions without a valid GPR.
To evaluate the significance of providing the set of compounds present in the model as input to TranSyT, two scenarios were evaluated. The initial test case was created without employing the metabolites filter, which led to TranSyT annotating 670 genes as encoding transport proteins and 10 164 transport reactions. Among the genes linked to transporters in the model, TranSyT did not provide results for 69 genes. However, it successfully identified 240 new genes that were not initially included in the model or assigned nontransport functions.
The second method was performed using the filter, creating 1626 transport reactions, 84% less than the former methodology. The number of genes with transport reactions was significantly lower (572); unexpectedly, only two more genes (71 in total) were annotated as transporters in iML1515 but not by TranSyT. A detailed analysis of the results obtained is available in Fig. 5.
Figure 5.
TranSyT results for E.coli str. K-12 substr. MG1655 using only compounds present in the iML1515 model. The figure represents the percentage of gene annotations present only in the model, in TranSyT, and both, as well as the comparison of the common transport reactions
TranSyT and iML1515 share a set of genes (447) for which both have transport reactions. Here, 73% perfectly match the transport reactions. A partial match was obtained for 16%, in which TranSyT has reactions for a significant portion of the compounds transported in the model. There is usually no clear evidence in the reference databases that the model’s transport reactions should be associated with such genes. Regarding the 4% gene annotations assessed as incomplete, the compounds are mentioned in the reference databases, but not described in TCDB’s substrates and MetaCyc. Whereas 6% of the gene annotations are utterly different in iML1515 and TranSyT. In these situations, TranSyT is usually supported by the evidence found in the reference databases. Finally, there are misannotations both in TCDB and MetaCyc that lead to creating incorrect transport reactions. Such genes represented 1% of the annotations and were assessed as in “conflict,” where the references have contradictory annotations.
TranSyT assigned reactions to 125 genes, not available in iML1515. A sample (33 genes) was manually curated, revealing that the reference databases support TranSyT’s classification. The complete annotations are available in sheet C2 of Supplementary Data 3.
A critical feature in TranSyT is the assignment of GPR rules to protein complexes. An example is iML1515’s reaction R_NADH16pp, in which the GPR includes 13 subunits. The same result was obtained in TranSyT, which annotated the reaction with TC number 3.D.1.1.1 and the same set of genes. A second example is the PTS reaction MANptspp, associated with five subunits in the model and three subunits in TCDB (TC number 4.A.6.1.1). In such cases, TranSyT uses KEGG to find the missing subunits. In this example, KEGG KOs K23993 and K02784, are related to TC families 8.A.7 (The Phosphotransferase System Enzyme I Family) and 8.A.8 (The Phosphotransferase System HPr Family), respectively. In addition, TranSyT also found a second alternative for the same reaction for TC number 4.A.1.1.14, with three subunits.
Finally, the model analysis showed that the iML1515 model does not provide gene rules for 53 transport reactions. TranSyT matched six of these reactions perfectly and generated gene rules for each one. A more thorough analysis showed that TranSyT partially matched 12 of such reactions, in which the difference is either the transport type or reversibility. The remaining 35 reactions were not available in TranSyT’s output. Detailed information about these reactions is available in sheet C5 of Supplementary Data S3.
4 Conclusions
A combination of all case studies results allows understanding that TranSyT‘s main limitation is the restricted number of different organisms available in TCDB, presenting the highest performance for organisms taxonomically close to E.coli, S.cerevisiae, and Homo sapiens. Hence, TranSyT uses the second classification method to overcome this issue, which should never replace the primary classifier, but instead as a counterpart. It is recommended to use less restrictive parameters for less studied organisms. Nevertheless, additional highly curated sources of transport data should be integrated in the future. Also, tools such as TrSSP and TranCEP could be integrated into TranSyT as these approaches use other algorithms and databases, extending TranSyT’s coverage and improving performance.
Without the metabolites filter, TranSyT returns an excessive amount of possible transport reactions. Nevertheless, this can be seen as a useful resource in cases where it might be interesting to investigate an organism’s capability to transport a specific metabolite across its membranes. Since TranSyT’s reactions already provide information regarding the direction of the reactions, the integration of membrane localization information can also be easily achieved using third-party software.
Supplementary Material
Acknowledgements
The submitted manuscript has been created by UChicago Argonne, LLC as Operator of Argonne National Laboratory (‘Argonne’) under Contract No. DE-AC02-06CH11357 with the U.S. Department of Energy. The U.S. Government retains for itself, and others acting on its behalf, a paid-up, nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. The authors would like to acknowledge project 22231/01/SAICT/2016: “Biodata.pt—Infraestrutura Portuguesa de Dados Biológicos,” supported by Lisboa Portugal Regional Operational Programme (Lisboa2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF). Oscar Dias acknowledges FCT for the Assistant Research contract obtained under CEEC Individual 2018. The authors would also like to acknowledge the Portuguese Foundation for Science and Technology (FCT) for providing a PhD scholarship to E. Cunha (DFA/BD/8076/2020).
Contributor Information
Emanuel Cunha, Centre of Biological Engineering, University of Minho, Braga 4704-553, Portugal.
Davide Lagoa, Centre of Biological Engineering, University of Minho, Braga 4704-553, Portugal; Computing, Environment, and Life Sciences Division, Argonne National Laboratory, Lemont, IL 60439, United States.
José P Faria, Computing, Environment, and Life Sciences Division, Argonne National Laboratory, Lemont, IL 60439, United States.
Filipe Liu, Computing, Environment, and Life Sciences Division, Argonne National Laboratory, Lemont, IL 60439, United States.
Christopher S Henry, Computing, Environment, and Life Sciences Division, Argonne National Laboratory, Lemont, IL 60439, United States.
Oscar Dias, Centre of Biological Engineering, University of Minho, Braga 4704-553, Portugal; LABBELS—Associate Laboratory, Braga/Guimarães, Portugal.
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest
None declared.
Funding
This study was supported by the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UIDB/04469/2020 unit. This work is supported by the Office of Biological and Environmental Research’s Genomic Science program within the US Department of Energy Office of Science, under award numbers DE-AC02-05CH11231, DE-AC02-06CH11357, DE-AC05-00OR22725, and DE-AC02-98CH10886. E. Cunha was funded by the Portuguese Foundation for Science and Technology, under grant number DFA/BD/8076/2020.
References
- Agren R, Liu L, Shoaie S et al. The RAVEN toolbox and its use for generating a genome-scale metabolic model for Penicillium chrysogenum. PLoS Comput Biol 2013;9:e1002980. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alballa M, Aplop F, Butler G. TranCEP: predicting the substrate class of transmembrane transport proteins using compositional, evolutionary, and positional information. PLoS One 2020;15:e0227683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aplop F, Butler G. TransATH: transporter prediction via annotation transfer by homology. ARPN J Eng Appl Sci 2017;12:317–24. [Google Scholar]
- Arkin AP, Cottingham RW, Henry CS et al. KBase: the United States department of energy systems biology knowledgebase. Nat Biotechnol 2018;36:566–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Axe DD, Bailey JE. Transport of lactate and acetate through the energized cytoplasmic membrane of Escherichia coli. Biotechnol Bioeng 1995;47:8–19. [DOI] [PubMed] [Google Scholar]
- Aziz RK, Bartels D, Best AA et al. The RAST server: rapid annotations using subsystems technology. BMC Genomics 2008;9:75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barghash A, Helms V. Transferring functional annotations of membrane transporters on the basis of sequence similarity and sequence motifs. BMC Bioinf 2013;14:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bateman A, Martin MJ, Orchard S et al. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 2021;49:D480–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brohée S, Barriot R, Moreau Y et al. YTPdb: a wiki database of yeast membrane transporters. Biochim Biophys Acta 2010;1798:1908–12. [DOI] [PubMed] [Google Scholar]
- Capela J, Lagoa D, Rodrigues R et al. Merlin, an improved framework for the reconstruction of high-quality genome-scale metabolic models. Nucleic Acids Res 2022;50:6052–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cardenas J, Da Silva NA. Engineering cofactor and transport mechanisms in Saccharomyces cerevisiae for enhanced acetyl-CoA and polyketide biosynthesis. Metab Eng 2016;36:80–9. [DOI] [PubMed] [Google Scholar]
- Caspi R, Billington R, Keseler IM et al. The MetaCyc database of metabolic pathways and enzymes – a 2019 update. Nucleic Acids Res 2020;48:D445–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang A, Jeske L, Ulbrich S et al. BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Res 2021;49:D498–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dias O, Gomes D, Vilaca P et al. Genome-wide semi-automated annotation of transporter systems. IEEE/ACM Trans Comput Biol Bioinform 2017;14:443–56. [DOI] [PubMed] [Google Scholar]
- Doshi R, Nguyen T, Chang G. Transporter-mediated biofuel secretion. Proc Natl Acad Sci USA 2013;110:7642–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Elbourne LDH, Tetu SG, Hassan KA et al. TransportDB 2.0: a database for exploring membrane transporters in sequenced genomes from all domains of life. Nucleic Acids Res 2017;45:D320–D324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Elbourne LDH, Wilson-Mortier B, Ren Q et al. Transaap: an automated annotation pipeline for membrane transporter prediction in bacterial genomes. Microb Genomics 2023;9:000927. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamilton JJ, Reed JL. Software platforms to facilitate reconstructing genome-scale metabolic networks. Environ Microbiol 2014;16:49–59. [DOI] [PubMed] [Google Scholar]
- Henry CS, Dejongh M, Best AA et al. High-throughput generation, optimization and analysis of genome-scale metabolic models. Nat Biotechnol 2010;28:977–82. [DOI] [PubMed] [Google Scholar]
- Kanehisa M, Goto S, Kawashima S et al. The KEGG resource for deciphering the genome. Nucleic Acids Res 2004;32:D277–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karp PD. Pathway databases: a case study in computational symbolic theories. Science 2001;293:2040–4. [DOI] [PubMed] [Google Scholar]
- King ZA, Lu J, Dräger A et al. BiGG models: a platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Res 2016;44:D515–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kwon C, Yun HS. Plant exocytic secretion of toxic compounds for defense. Toxicol Res 2014;30:77–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee TJ, Paulsen I, Karp P. Annotation-based inference of transporter function. Bioinformatics 2008;24:i259–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li W, Ma L, Shen X et al. Targeting metabolic driving and intermediate influx in lysine catabolism for high-level glutarate production. Nat Commun 2019;10:3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li W, O'Neill KR, Haft DH et al. RefSeq: expanding the prokaryotic genome annotation pipeline reach with protein family model curation. Nucleic Acids Res 2021;49:D1020–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu F. Evaluation and development of algorithms and computational tools for metabolic pathway optimization. Ph.D. Thesis, University of Minho School of Engineering, 2018.
- McCracken AN, Edinger AL. Nutrient transporters: the achilles’ heel of anabolism. Trends Endocrinol Metab 2013;24:200–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mishra NK, Chang J, Zhao PX. Prediction of membrane transport proteins and their substrate specificities using primary sequence information. PLoS One 2014;9:e100278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Monk JM, Lloyd CJ, Brunk E et al. iML1515, a knowledgebase that computes Escherichia coli traits. Nat Biotechnol 2017;35:904–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park M, Li Q, Shcheynikov N et al. NaBC1 is a ubiquitous electrogenic Na+-coupled borate transporter essential for cellular boron homeostasis and cell growth and proliferation. Mol Cell 2004;16:331–41. [DOI] [PubMed] [Google Scholar]
- Quentin Y, Fichant G. ABCdb: an ABC transporter database. J Mol Microbiol Biotechnol 2000;2:501–4. [PubMed] [Google Scholar]
- Record M, Carayon K, Poirot M et al. Exosomes as new vesicular lipid transporters involved in cell–cell communication and various pathophysiologies. Biochim Biophys Acta 2014;1841:108–20. [DOI] [PubMed] [Google Scholar]
- Rink L, Haase H. Zinc homeostasis and immunity. Trends Immunol 2007;28:1–4. [DOI] [PubMed] [Google Scholar]
- Saier MH, Tran CV, Barabote RD. TCDB: the transporter classification database for membrane transport protein analyses and information. Nucleic Acids Res 2006;34:D181–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schein JR, Hunt KA, Minton JA et al. The nucleobase cation symporter 1 of Chlamydomonas reinhardtii and that of the evolutionarily distant Arabidopsis thaliana display parallel function and establish a plant-specific solute transport profile. Plant Physiol Biochem 2013;70:52–60. [DOI] [PubMed] [Google Scholar]
- Schellenberger J, Que R, Fleming RMT et al. Quantitative prediction of cellular metabolism with constraint-based models: the COBRA toolbox v2.0. Nat Protoc 2011;6:1290–307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwacke R, Schneider A, van der Graaff E et al. ARAMEMNON, a novel database for arabidopsis integral membrane proteins. Plant Physiol 2003;131:16–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Swainston N, Smallbone K, Mendes P et al. The SuBliMinaL toolbox: automating steps in the reconstruction of metabolic networks. J Integr Bioinform 2011;8:186. [DOI] [PubMed] [Google Scholar]
- Thiele I, Palsson BØ. A protocol for generating a high-quality genome-scale metabolic reconstruction. Nat Protoc 2010;5:93–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Versaw WK, Garcia LR. Intracellular transport and compartmentation of phosphate in plants. Curr Opin Plant Biol 2017;39:25–30. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.