Abstract
Motivation: Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore, needed to assist in the identification of misannotated gene products. In the case of enzymatic functions, each functional assignment implies the existence of a reaction within the organism’s metabolic network; a first approximation to a genome-scale metabolic model can be obtained directly from an automated genome annotation. Any obvious problems in the network, such as dead end or disconnected reactions, can, therefore, be strong indications of misannotation.
Results: We demonstrate that a machine-learning approach using only network topological features can successfully predict the validity of enzyme annotations. The predictions are tested at three different levels. A random forest using topological features of the metabolic network and trained on curated sets of correct and incorrect enzyme assignments was found to have an accuracy of up to 86% in 5-fold cross-validation experiments. Further cross-validation against unseen enzyme superfamilies indicates that this classifier can successfully extrapolate beyond the classes of enzyme present in the training data. The random forest model was applied to several automated genome annotations, achieving an accuracy of in most cases when validated against recent genome-scale metabolic models. We also observe that when applied to draft metabolic networks for multiple species, a clear negative correlation is observed between predicted annotation quality and phylogenetic distance to the major model organism for biochemistry (Escherichia coli for prokaryotes and Homo sapiens for eukaryotes).
Contact: j.pinney@imperial.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Misannotation in sequence databases has been a recognized problem for more than a decade. Early studies reported the emergence of this issue (Brenner et al., 1999; Galperin et al., 1998) and estimated that up to 30% of proteins were misannotated in public databases (Devos and Valencia, 2001). More recent studies have confirmed that this problem is still a reality (Jones et al., 2007) and some even suggest that it has been getting worse over time (Schnoes et al., 2009), identifying overprediction and error propagation as the main sources of error. As experimental verification of gene function is expected to remain a highly time-consuming process, it is unlikely that it will be able to keep pace with the increasing amount of genome sequence data being deposited in public databases. More accurate computational methods for functional annotation and assessment of confidence in gene annotations are, therefore, increasingly necessary.
In the area of automated functional annotation, several approaches moving beyond basic sequence similarity are now available (Jones et al., 2007). Some recent annotation software will classify proteins based on locally conserved sequence patterns that are normally related with function (Forslund and Sonnhammer, 2008). Other approaches take into account the evolutionary relationships between proteins by integrating evidence across phylogenetic trees (Engelhardt et al., 2009) or use additional information, such as protein–protein interaction data (Ta and Holm, 2009) or genomic correlations (Hsiao et al., 2010).
However, functional annotation is still mainly based on sequence similarity. Given this fact, the accuracy of existing annotations has a crucial impact on that of future annotations (Jones et al., 2007). This dependency can lead to error propagation and a consequent increase in the number of annotation errors (Gilks et al., 2002). Moreover, as information on the origin of annotation is often scarce, this error propagation does not have an easy solution. The problem becomes even clearer when we note that the proportion of manually annotated proteins is <5% and continues to decrease (Frishman, 2007).
Any evidence that is independent of sequence may, therefore, be useful for discriminating between true and false functional annotations. The concept of gene function implies interaction with some part of the cell or the environment, and almost all functions of interest are the result of interactions among several components (Hartwell et al., 1999). Modelling these interactions by means of networks and studying their topological properties is, therefore, one way to understand the context of these molecular functions.
One easily accessible example of a well-defined molecular network derived from a set of gene annotations is a draft metabolic network, such as those available in the KEGG database (Kanehisa and Goto, 2000; Kanehisa et al., 2006, 2008). The topological properties of these networks have been studied previously in the contexts of network evolution (Wagner and Fell, 2001) and drug target discovery (Yeh et al., 2004). For example, the metabolic networks of parasitic species are known to be distinguishable from non-parasitic species on the basis of their topology (Borenstein and Feldman 2009; Nerima et al., 2010). Intuitively, any problems in such a network, for example, dead ends or disconnected components, could be an indication of misannotation (Poolman et al., 2006). However, each individual type of evidence can be relatively weak (e.g. dead ends may also be due to the uptake of nutrients from the environment) and difficult to discern by manual inspection.
In this work, we propose a supervised machine-learning methodology to assess the accuracy of assigned molecular functions, based on simple topological properties of an organism’s draft metabolic network. We show that our approach is able to separate correct annotations from incorrect ones with accuracy of up to 86%. Being entirely independent of sequence properties, it can be used to complement existing approaches and, hence, contribute to the detection and correction of errors in functional annotation.
2 METHODS
2.1 Metabolic networks
Bipartite (reaction and compound) graphs were used to represent metabolic networks, generated using the KEGG LIGAND database (Kanehisa et al., 2008). To reconstruct the metabolic network for each species, all gene functions annotated for that species were collected. The reactions mapped to each function were then retrieved. Finally, the compounds attached to each reaction were added to produce a bipartite metabolic network for each species. All reactions were considered as being reversible. Network topological properties were calculated using the NetworkX library in Python.
2.2 Training data
Schnoes et al. (2009) previously examined the annotation errors in four large public protein databases (KEGG, GenBank NR, UniprotKB/TrEMBL and UniProtKB/SwissProt). From their correct and incorrect annotation data, only the annotations with EC number were considered. In total, there were 834 correct and 477 incorrect annotations from six different superfamilies. For each annotation, the dataset presents the species, KEGG KO group, EC number and the part of the protocol that the annotation failed to pass. Each annotated function was mapped to a reaction according to KEGG. Where an EC function was mapped to more than one reaction, one of these was chosen at random. To evaluate the topological properties of each of the annotations, KEGG species networks were used.
2.3 Machine learning
The approach used to separate correct from incorrect annotations was the random forest. A random forest is an ensemble of decision trees. During the training process, to achieve a variety of different decision trees, a random subset of parameters is selected for each node. Afterwards, as in a standard decision tree, the parameter chosen at each node is the one that most increases the entropy. To predict the label of an entry, the entry is assessed by every tree of the ensemble. The distribution of label votes returned is the random forest prediction. In our case, the probability of an annotation being correct is taken as the proportion of trees that labelled it as correct.
The random forest used was the one implemented in the randomForest R package (Liaw and Wiener, 2002). The algorithm implemented is as described in Breiman (2001). The parameters used in both the randomForest and predict functions were the default ones. For building the receiver–operator characteristic (ROC) curves, the type = ‘prob’ option in the predict function was used.
2.3.1 The 5-fold cross-validation
The cross-validation process used was to start with the original data (D) and divide it in five equal sets. Each of the sets was used as an independent test set (Dtest). The random forest algorithm considering all available features was applied to the remaining four sets (Dtrain). The random forest predictor built was then tested on Dtest.
2.3.2 Inter-superfamily cross-validation
The training data were grouped by enzyme superfamily. Because of the paucity of data in most superfamilies, only the four most populated superfamilies were taken forwards to cross-validation. Each superfamily in turn was removed from the balanced dataset SF to form the test set SFtest. The random forest algorithm was applied to the remainder (SFtrain). The model built was then tested on SFtest.
2.3.3 Final classifier
The random forest was trained on the whole of the original data using all the features. The importance function from the randomForest R package was used to assess each feature’s individual performance after training the model with the full training set.
2.4 Comparison against curated models
To further validate the classifier, it was applied to 24 KEGG metabolic networks, and the results were compared with curated genome-scale metabolic models for these species (Table 4). The species used were the set with whole-genome models listed in Feist et al. (2009) for which functions were labelled with EC numbers. For each KEGG model considered, each annotated function was mapped to a reaction according to KEGG. Where an EC function was mapped to more than one reaction, one of these was chosen at random. The classifier was applied to these KEGG data, and the results were compared with the curated models, verifying the presence or absence in the curated models of the functions assigned in the KEGG models.
Table 4.
KEGG ID | Species name | AUC | Citation |
---|---|---|---|
ani | Aspergillus nidulans | 0.56 | David et al., 2008 |
ath | Arabidopsis thaliana | 0.57 | de Oliveira Dal’Molin et al., 2010 |
bsu | Bacillus subtilis | 0.61 | Oh et al., 2007 |
buc | Buchnera aphidicola | 0.68 | Thomas et al., 2009 |
det | Dehalococcoides ethenogenes | 0.60 | Islam et al., 2010 |
eco | E.coli K-12 | 0.55 | Reed et al., 2003 |
hsl | Halobacterium salinarum | 0.60 | Gonzalez et al., 2008 |
lpl | Lactobacillus plantarum | 0.64 | Teusink et al., 2006 |
mge | M.genitalium | 0.43 | Suthers et al., 2009 |
nme | Neisseria meningitidis | 0.58 | Baart et al., 2007 |
nph | Natronomonas pharaonis | 0.60 | Gonzalez et al., 2010 |
pfa | P.falciparum | 0.59 | Plata et al., 2010 |
pgi | Porphyromonas gingivalis | 0.60 | Mazumdar et al., 2009 |
pic | Pichia stipitis | 0.48 | Caspeta et al., 2012 |
sau | Staphylococcus aureus | 0.52 | Lee et al., 2009 |
sce | S.cerevisiae | 0.56 | Herrgård et al., 2008 |
sce | S.cerevisiae | 0.53 | Förster et al., 2003 |
sco | Streptomyces coelicolor | 0.64 | Borodina et al., 2005 |
sco | S.coelicolor | 0.63 | Alam et al., 2010 |
son | Shewanella oneidensis | 0.55 | Pinchuk et al., 2010 |
syn | Synechocystis PCC6803 | 0.57 | Nogales et al., 2012 |
vvu | Vibrio vulnificus | 0.52 | Kim et al., 2011 |
ypm | Yersinia pestis | 0.55 | Navid and Almaas, 2009 |
zmo | Zymomonas mobilis | 0.61 | Widiastuti et al., 2011 |
Note: The final classifier was applied to KEGG metabolic models, and the results were compared with curated genome-scale metabolic models for these species.
2.5 Tree of life analysis
Ciccarelli et al. (2006) have reconstructed a highly resolved tree of life. Their species tree is built from a concatenation of 31 unambiguous orthologues present in 191 species. This tree and the multiple alignments used to build it were downloaded from iTOL (Letunic and Bork, 2007, 2011). iTOL also provides other types of data related to these species, including genome sizes, domains per genome and publication dates. The multiple alignment was used to calculate the distances between the species using protdist from PHYLIP (Felsenstein, 1993), a package of programs for inferring phylogenies. The classifier was applied to the metabolic networks present in KEGG for each species included in the iTOL phylogeny.
3 RESULTS AND DISCUSSION
In this study, metabolic networks are represented by bipartite digraphs (with nodes for each reaction and compound). A network was built for each organism in the study, based on a template taken from the KEGG LIGAND database (Kanehisa et al., 2008).
As with any supervised machine-learning task, it is necessary to choose a machine-learning method and a set of features from which to learn. The random forest (Breiman, 2001) was found to be a suitable machine-learning approach for our aims. The advantages of using random forests in this work are their ability to process both numerical and categorical data and the interpretability of their output (a so-called ‘white box’ model). In contrast to other machine-learning methods, such as neural networks or support vector machines, random forests can provide insights into the signals that are useful for classification.
Training and testing data sets were taken from the work of Schnoes et al. (2009), which provide gold-standard sets of correct and incorrect EC number assignments within 331 species in KEGG, across six enzyme superfamilies. In addition to sequence similarity approaches at the superfamily and family levels, the authors used information on functionally important residues to infer misannotations, making this one of the most reliable data sources suitable for our purposes.
3.1 Features
In total, 22 different network topological features were considered in training the classifier. These features can be placed into three broad groups: local, semi-local and global features (Table 1).
Table 1.
Group | Feature | Definition |
---|---|---|
1 | m | Number of compounds connected to >2 reactions. |
u | Number of unpaired compounds. | |
t | Reaction type: 1—unpaired compounds on both sides of the reaction, 2—unpaired compounds on only one side, 3—no unpaired compounds. | |
h | Number of chokepoint compounds. | |
c | Number of compounds. | |
Number of compounds connected to >2 and <10 reactions. | ||
Number of compounds connected to 10–50 reactions | ||
Number of compounds connected to >50 reactions. | ||
R | Number of other reactions sharing a compound with this reaction. | |
Mean number of other reactions connected to each compound. | ||
r1 | Number of connections of the least connected compound. | |
r2 | Number of connections of the second least connected compound. | |
r3 | Number of connections of the third least connected compound. | |
r4 | Number of connections of the fourth least connected compound. | |
2 | e | Eccentricity using unweighted edges, |
Normalized eccentricity using unweighted edges. | ||
ew | Eccentricity using weighted edges | |
Normalized eccentricity using weighted edges | ||
b | Betweenness using unweighted edges | |
bw | Betweenness using weighted edges | |
N | Number of reactions in the connected component. | |
3 | Fraction of reactions of type 1 or 2 in the network. | |
4 | G | Domain: 1—Bacteria, 2—Eukaryota, 3—Archaea. |
D | 1—species is related to disease, 0—species is not related to disease. |
Note: The features chosen were divided into four groups as shown: 1—local, 2—semi-local, 3—global and 4—non-topological.
Local topological features capture the properties of the immediate neighbourhood of each reaction. Several of these features are related to the compounds involved in the reaction, each of which can be classified according to its connectivity (degree) as an unpaired, chokepoint or ‘normal’ metabolite (Supplementary Fig. S1). Based on this classification, several integer attributes were defined for each reaction. We noticed that the connectivity of compounds involved in a reaction tends to vary depending on enzyme class; therefore, four additional features were defined to capture this variation. These features correspond to the ranked connectivities of the reaction’s four least-connected compounds.
The semi-local topological features describe the position of each reaction within the network. These features are based on the graph theoretical concepts of betweenness centrality and eccentricity. The betweenness of a node is the fraction of shortest paths (geodesics) between all pairs of nodes in the network that include that node, whereas the eccentricity of a node is the length of the longest geodesic between the node and all other nodes in the network. In both cases, these values were also calculated including weights on the edges of the networks. Weighted metabolic networks have previously proved useful in the automatic identification of biologically meaningful pathways within a metabolic network (Croes et al., 2006). This is a simple way to exclude spurious links via highly connected compounds, such as water or adenosine triphosphate. Here, we place a weight on each compound equal to its connectivity. To take variations in network size into account, we also considered a variant of eccentricity that is normalized by dividing by the diameter of the connected component to which the reaction belongs.
In addition to these reaction-based features, some global topological features of the network may be relevant, for example, if the amount of human curation varies between species. We use the proportion of reactions that have a dead-end compound on one or both sides as a proxy for the overall reliability of the network.
Two non-topological features were also considered: taxonomic domain (Archaea, Bacteria or Eukaryota) and whether the organism is implicated in a disease. The reason for including these two features was to allow for potential topological differences between different domains and between pathogens and non-pathogens. It has previously been shown that metabolic network topology can be affected by variations in the selection pressures experienced during evolution (Borenstein and Feldman, 2009; Kreimer et al., 2008; Parter et al., 2007).
To gain intuition of which features may have a bigger influence on the results, the performance of each individual feature was evaluated independently. Histograms of the correct and incorrect annotation data provide a visual summary (Fig. 1 and Supplementary Fig. S2). A quantitative evaluation of each feature’s performance was also obtained using the importance function from the randomForest package (Liaw and Wiener, 2002). This function evaluates the accuracy decrease and the entropy decrease when each feature is left out, with results shown in Supplementary Figure S3.
All metrics show a similar ranking between the features, with those based on the concepts of betweenness and eccentricity seen to be the most highly predictive. It is possible that these semi-local features are able to capture relevant differences in relative network position (e.g. higher eccentricity indicates reactions that lie towards the periphery of the network). The weighted network factor seems to improve the performance of both eccentricity and betweenness features, although it is more clearly seen in the case of eccentricity.
The taxonomic domain is the least informative feature. This may imply that the features already considered, such as the connected component size, may already be capturing any differences between species from different domains. The same might be happening with the disease-related feature. For example, parasitic species may be expected to have a larger number of unpaired compounds and smaller connected components, making this feature less informative. However, both features still show some predictive power.
3.2 Cross-validation
The performance of the classifier on unseen data was assessed using two types of cross-validation. In 5-fold cross-validation experiments (Table 2), the model obtained has an accuracy of ∼86%. Supplementary Figure S4 shows the ROC curves obtained for each of the cross-validation folds. The mean area under the ROC curve (AUROC) was 0.92%. Another important aspect of performance is how well the predictor would be expected to perform on enzymes from unseen superfamilies. To this end, a second cross-validation was performed using the four most represented superfamilies present in the Schnoes et al. dataset: Enolase, Vicinal Oxygen Chelate, Haloacid Dehalogenase and Amidohydrolase. The cross-validation used the enzymes from three out of the four superfamilies as a training set and tested on the enzymes from the fourth (Table 3). In this experiment, with the exception of the Vicinal Oxygen Chelate superfamily, the accuracy of the predictor was consistently >60%. Supplementary Figure S5 shows the ROC curves for each superfamily. The area under the curve varied between 0.59 and 0.68. These results suggest that the functional classes covered in the training data do have an effect on the rules obtained. For example, enzyme classes may occupy topologically distinct positions in the network, and/or be subject to particular types of misannotation. However, these results indicate that the classifier trained on the entire available data set should still be informative when applied more generally.
Table 2.
Mean (SD) | |
---|---|
Accuracy | 0.86 (0.005) |
Precision | 0.91 (0.009) |
Recall | 0.88 (0.011) |
AUC | 0.92 (0.007) |
Note: The predictive model performance was assessed by a 5-fold cross-validation. The table shows the accuracy, precision, recall and AUC of this analysis and their standard deviations.
Table 3.
Superfamily | Accuracy | Precision | Recall | AUC |
---|---|---|---|---|
Enolase | 0.60 | 0.57 | 0.97 | 0.60 |
Vicinal oxygen chelate | 0.52 | 0.86 | 0.51 | 0.59 |
Haloacid dehalogenase | 0.60 | 0.77 | 0.46 | 0.67 |
Amidohydrolase | 0.66 | 0.69 | 0.74 | 0.68 |
Note: To test performance on unseen enzyme classes, the classifier was assessed in a leave-one-out cross-validation at the superfamily level. The table shows the accuracy, precision, recall and the AUC of each analysis, where each superfamily in turn was used as the test dataset.
3.3 Comparison to a manually curated network
To assess the performance of the model, the classifier was applied to the 24 KEGG genome annotations. These results were compared with recent manually curated genome-scale metabolic models as gold standards (Table 4 and Supplementary Fig. S6). The species used were the whole-genome models listed in Feist et al. (2009) for which enzyme functions were labelled with EC numbers. The AUC results were consistently >0.5, showing a performance better than random. In fact, in almost half of the species tested, the classifier produced an AUC of ≥0.6. There were only two cases where AUC was found to be <0.5. The worst result was found with Mycoplasma genitalium, perhaps related to the fact that this is the smallest prokaryote genome sequenced.
3.4 Case study: an atypical orthologue
An interesting example of the successful identification of an unexpected enzyme function is given by Dittrich et al. (2008). This work was based on the idea that an evolving enzyme has more chance to acquire the function of structurally similar enzymes. A bioinformatic protocol was followed to draw up a shortlist of candidate functional analogues of a missing enzyme (dihydroneopterin aldolase, DHNA) in the Plasmodium falciparum folate biosynthesis pathway.
During the process, the authors found two candidates for filling the role of the missing enzyme. Both enzymes already had an assigned function in KEGG: PFF1360w is annotated as a putative 6-pyruvoyl tetrahydropterin synthase (PTPS) and PFL1155w as GTP cyclohydrolase I (GTPCH-I). Although PFF1360w was subsequently experimentally validated as performing the missing DHNA function, KEGG has not yet updated this annotation. This enables us to apply the classifier to the KEGG P.falciparum metabolic network to study this case.
Taking a closer look at the two annotated reactions in their network context (Supplementary Fig. S14), it can be seen that the PTPS reaction seems to be a dead end, indicating that this annotation is unlikely to be correct. In contrast, the GTPCH-I enzyme not only has its reactants produced and its products consumed, as seen in the figure, but is also assigned to four chokepoint reactions.
Applying our classifier to these two enzymatic functions, it returned a probability of 0.94 for the GTPCH-I reaction, indicating that this function seems to make biological sense within its network context. On the other hand, the PTPS reaction scores only a probability of 0.21 to be a correct annotation. This simple case study shows that the classifier has successfully captured the same network topological features that provided evidence for an incorrect annotation in the published manual analysis of this enzyme.
3.5 Comparison of predicted annotation quality across multiple species
To investigate how annotation quality varies between species, the classifier was applied to the KEGG metabolic networks of the species present in the tree of life provided by iTOL (Letunic and Bork, 2007, 2011). The proportion of enzymatic functions predicted to be correctly annotated in the network of each species (i.e. the predicted precision of the set of enzymatic functions reported by KEGG for that organism) was taken as a measure of annotation quality. Figure 2 shows the prokaryote phylogenetic tree and quality scores for each of the species. The Escherichia coli strains and the most closely related species produce the highest scores, indicating their higher levels of curation. With the exception of Chlamydiae/Verrucomicrobia and the Cyanobacteria, all phyla show a wide variety of quality scores.
The number of eukaryotic species provided by iTOL is much smaller than the number of prokaryotes. Supplementary Figure S7 shows the eukaryote phylogenetic tree and the quality scores of the KEGG metabolic networks for each of the species. The vertebrates and plants produce higher scores than the other species. An unexpected result is the relatively low scores reported for Saccharomyces cerevisiae and Drosophila melanogaster (both 0.73), especially when compared with those achieved by the vertebrates. However, this most probably reflects the massive amount of study that human biochemistry has received relative to any other eukaryote, including these two important model organisms.
It is reasonable to expect that the quality of a draft metabolic network should be better for species that are closely related to organisms with well characterized biochemistry. Figure 3 shows that this is indeed the case: there is a clear negative correlation () between the predicted annotation quality in prokaryotes and the phylogenetic distance to E.coli and an even stronger negative correlation () between the predicted annotation quality in eukaryotes and the phylogenetic distance to Homo sapiens.
To check for any dependency between annotation quality and genome size, a similar scatter plot was drawn (Fig. 4). Although a positive correlation is present, this may be partially explained by other factors. In particular, the intracellular obligate species (highlighted in green in Fig. 4) and the well-curated species (highlighted in orange), constituted by the E.coli strains and closely related species (Salmonella and Yersinia), have distinctly low- and high-quality scores, respectively. As intracellular obligate species will tend to have lost many genes that are necessary for free-living organisms (Ochman and Moran, 2001), their genomes are smaller than average: intracellular obligates are almost exclusively at the bottom left of the plot. The low-quality scores for this group of species (Fig. 4) may indicate either an increased difficulty in reconstructing their metabolic networks by automatic methods or simply the known general topological differences between their metabolic networks and those of the other prokaryotes (Ochman and Moran, 2001). These two groups of species tend to enhance the correlation between predicted annotation quality and genome size. Without these species, the correlation becomes slightly weaker (changing from to ).
In addition to the intracellular obligates and well-studied bacteria, the box plots in Figure 5 show the predicted annotation quality for two further sets of species: those with available manually curated genome-scale reconstructions (GENREs) (Price et al., 2004) and those that are facultatively intracellular. We can clearly see the low-quality scores in the obligate (although not the facultative) intracellular species (P = 1.16e-08) and the high accuracy scores in the well-studied species set (P = 3.06e-06). However, the extra curation possibly provided by the existence of a GENRE is not seen to be reflected in the semi-automated annotations within KEGG.
For prokaryotes, possible dependencies on other species attributes were also considered: motility, phylum, pathogenicity, oxygen requirement and habitat (Supplementary Figs S8–S12). The quality scores do not seem to depend on these attributes, with the exception of habitat: the species living in specialized habitats have lower accuracy scores compared with all other species (P = 4.33e-08). As stated earlier in the text, specialized environments may be responsible for differences in selective pressures that could result in detectable differences in metabolic network topologies.
The possible link between annotation quality and genome size was also checked in eukaryotes. As show in Figure 4, a positive correlation is present. However, closer inspection shows that there are two well-defined groups that contribute to this correlation. Towards the bottom left (small genomes, low-annotation quality) are the protists and the fungal species, and at the top right are a group of animals (mostly vertebrates) and plants. Taken together with the fact that the number of species present is small, there does not seem to be strong evidence for a direct link between genome size and annotation quality in eukaryotes.
For both eukaryotes and prokaryotes, other possible dependencies were studied, including the number of publications found in PubMed for each species and the year that the genomes considered were published. However, no significant correlations were found between the quality of the model and these factors (Supplementary Figs S15–S17).
4 CONCLUSION
Our results have demonstrated that simple topological features can be used to predict incorrect functional annotations within metabolic networks. The random forest classifier has not only achieved high overall cross-validation accuracy but has also been shown to be informative when applied to enzymes belonging to superfamilies that were not used in training. This approach is entirely independent of sequence properties; hence, it could be used to support automated metabolic reconstruction pipelines, as well as helping to identify incorrectly annotated enzymes within public databases. Subsequent improvements in the accuracy of the genome-scale metabolic models obtained will be of benefit in their downstream analysis, for example, using constraint-based methods, such as flux balance analysis (Oberhardt et al., 2009).
For both prokaryotes and eukaryotes, it seems that the quality of automated metabolic reconstruction decreases with phylogenetic distance to the major model organism for biochemistry, E.coli and human, respectively. However, differences in network topology between free-living organisms and obligate intracellular species may make the classifier less accurate when applied to the latter group of species. Given a larger amount of training data, it should be possible to produce separate classifiers for each of these two groups.
Funding: R.L. is funded by a studentship from the Fundação para a Ciência e a Tecnologia, Portugal. J.W.P. is funded by a University Research Fellowship from the Royal Society.
Conflict of Interest: none declared.
Supplementary Material
REFERENCES
- Alam M, et al. Metabolic modeling and analysis of the metabolic switch in Streptomyces coelicolor. BMC Genomics. 2010;11:202. doi: 10.1186/1471-2164-11-202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baart G, et al. Modeling Neisseria meningitidis metabolism: from genome to metabolic fluxes. Genome Biol. 2007;8:R136. doi: 10.1186/gb-2007-8-7-r136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Borenstein E, Feldman MW. Topological signatures of species interactions in metabolic networks. J. Comput. Biol. 2009;16:191–200. doi: 10.1089/cmb.2008.06TT. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Borodina I, et al. Genome-scale analysis of Streptomyces coelicolor a3 (2) metabolism. Genome Res. 2005;15:820–829. doi: 10.1101/gr.3364705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breiman L. Random forests. Mach. Learn. 2001;45:5–32. [Google Scholar]
- Brenner S, et al. Errors in genome annotation. Trends Genet. 1999;15:132–133. doi: 10.1016/s0168-9525(99)01706-0. [DOI] [PubMed] [Google Scholar]
- Caspeta L, et al. Genome-scale metabolic reconstructions of Pichia stipitis and Pichia pastoris and in silico evaluation of their potentials. BMC Syst. Biol. 2012;6:24. doi: 10.1186/1752-0509-6-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ciccarelli FD, et al. Toward automatic reconstruction of a highly resolved tree of life. Science. 2006;311:1283–1287. doi: 10.1126/science.1123061. [DOI] [PubMed] [Google Scholar]
- Croes D, et al. Inferring meaningful pathways in weighted metabolic networks. J. Mol. Biol. 2006;356:222–236. doi: 10.1016/j.jmb.2005.09.079. [DOI] [PubMed] [Google Scholar]
- David H, et al. Analysis of Aspergillus nidulans metabolism at the genome-scale. BMC Genomics. 2008;9:163. doi: 10.1186/1471-2164-9-163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Oliveira Dal’Molin C, et al. Aragem, a genome-scale reconstruction of the primary metabolic network in Arabidopsis. Plant Physiol. 2010;152:579–589. doi: 10.1104/pp.109.148817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devos D, Valencia A. Intrinsic errors in genome annotation. Trends Genet. 2001;17:429–431. doi: 10.1016/s0168-9525(01)02348-4. [DOI] [PubMed] [Google Scholar]
- Dittrich S, et al. An atypical orthologue of 6-pyruvoyltetrahydropterin synthase can provide the missing link in the folate biosynthesis pathway of malaria parasites. Mol. Microbiol. 2008;67:609–618. doi: 10.1111/j.1365-2958.2007.06073.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Engelhardt BE, et al. Phylogenetic molecular function annotation. J. Phys. 2009;180:12024. doi: 10.1088/1742-6596/180/1/012024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feist AM, et al. Reconstruction of biochemical networks in microorganisms. Nat Rev. Microbiol. 2009;7:129–143. doi: 10.1038/nrmicro1949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Felsenstein J. PHYLIP - Phylogeny Inference Package (Version 3.5) 1993. Distributed by the author. Department of Genetics, University of Washington, Seattle. [Google Scholar]
- Forslund K, Sonnhammer EL. Predicting protein function from domain content. Bioinformatics. 2008;24:1681–1687. doi: 10.1093/bioinformatics/btn312. [DOI] [PubMed] [Google Scholar]
- Förster J, et al. Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res. 2003;13:244–253. doi: 10.1101/gr.234503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frishman D. Protein annotation at genomic scale: the current status. Chem. Rev. 2007;107:3448–3466. doi: 10.1021/cr068303k. [DOI] [PubMed] [Google Scholar]
- Galperin M, et al. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1998;1:55–67. [PubMed] [Google Scholar]
- Gilks WR, et al. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics. 2002;18:1641–1649. doi: 10.1093/bioinformatics/18.12.1641. [DOI] [PubMed] [Google Scholar]
- Gonzalez O, et al. Reconstruction, modeling & analysis of Halobacterium salinarum r-1 metabolism. Mol. BioSyst. 2008;4:148–159. doi: 10.1039/b715203e. [DOI] [PubMed] [Google Scholar]
- Gonzalez O, et al. Characterization of growth and metabolism of the haloalkaliphile Natronomonas pharaonis. PLoS Comput. Biol. 2010;6:e1000799. doi: 10.1371/journal.pcbi.1000799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hartwell LH, et al. From molecular to modular cell biology. Nature. 1999;402(Suppl. 6761):C47–C52. doi: 10.1038/35011540. [DOI] [PubMed] [Google Scholar]
- Herrgård M, et al. A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nat. Biotechnol. 2008;26:1155–1160. doi: 10.1038/nbt1492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hsiao TL, et al. Automatic policing of biochemical annotations using genomic correlations. Nat. Chem. Biol. 2010;6:34–40. doi: 10.1038/nchembio.266. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Islam M, et al. Characterizing the metabolism of Dehalococcoides with a constraint-based model. PLoS Comput. Biol. 2010;6:e1000887. doi: 10.1371/journal.pcbi.1000887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones CE, et al. Estimating the annotation error rate of curated go database sequence annotations. BMC Bioinform. 2007;8:170. doi: 10.1186/1471-2105-8-170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanehisa M, et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006;34:D354–D357. doi: 10.1093/nar/gkj102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanehisa M, et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008;36:D480–D484. doi: 10.1093/nar/gkm882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim H, et al. Integrative genome-scale metabolic analysis of Vibrio vulnificus for drug targeting and discovery. Mol. Syst. Biol. 2011;7:460. doi: 10.1038/msb.2010.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kreimer A, et al. The evolution of modularity in bacterial metabolic networks. Proc. Natl Acad. Sci. USA. 2008;105:6976. doi: 10.1073/pnas.0712149105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee D, et al. Comparative genome-scale metabolic reconstruction and flux balance analysis of multiple Staphylococcus aureus genomes identify novel antimicrobial drug targets. J. Bacteriol. 2009;191:4015–4024. doi: 10.1128/JB.01743-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Letunic I, Bork P. Interactive tree of life (itol): an online tool for phylogenetic tree display and annotation. Bioinformatics. 2007;23:127–128. doi: 10.1093/bioinformatics/btl529. [DOI] [PubMed] [Google Scholar]
- Letunic I, Bork P. Interactive tree of life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res. 2011;39:W475–W478. doi: 10.1093/nar/gkr201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liaw A, Wiener M. Classification and regression by randomforest. R News. 2002;2:18–22. [Google Scholar]
- Mazumdar V, et al. Metabolic network model of a human oral pathogen. J. Bacteriol. 2009;191:74–90. doi: 10.1128/JB.01123-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Navid A, Almaas E. Genome-scale reconstruction of the metabolic network in Yersinia pestis, strain 91001. Mol. BioSyst. 2009;5:368–375. doi: 10.1039/b818710j. [DOI] [PubMed] [Google Scholar]
- Nerima B, et al. Comparative genomics of metabolic networks of free-living and parasitic eukaryotes. BMC Genomics. 2010;11:217. doi: 10.1186/1471-2164-11-217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nogales J, et al. Detailing the optimality of photosynthesis in Cyanobacteria through systems biology analysis. Proc. Natl Acad. Sci. USA. 2012;109:2678–2683. doi: 10.1073/pnas.1117907109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oberhardt MA, et al. Applications of genome-scale metabolic reconstructions. Mol. Syst. Biol. 2009;5:320. doi: 10.1038/msb.2009.77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ochman H, Moran NA. Genes lost and genes found: evolution of bacterial pathogenesis and symbiosis. Science. 2001;292:1096–1099. doi: 10.1126/science.1058543. [DOI] [PubMed] [Google Scholar]
- Oh Y, et al. Genome-scale reconstruction of metabolic network in Bacillus subtilis based on high-throughput phenotyping and gene essentiality data. J. Biol. Chem. 2007;282:28791–28799. doi: 10.1074/jbc.M703759200. [DOI] [PubMed] [Google Scholar]
- Parter M, et al. Environmental variability and modularity of bacterial metabolic networks. BMC Evol. Biol. 2007;7:169. doi: 10.1186/1471-2148-7-169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pinchuk G, et al. Constraint-based model of Shewanella oneidensis mr-1 metabolism: a tool for data analysis and hypothesis generation. PLoS Comput. Biol. 2010;6:e1000822. doi: 10.1371/journal.pcbi.1000822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plata G, et al. Reconstruction and flux-balance analysis of the Plasmodium falciparum metabolic network. Mol. Syst. Biol. 2010;6:408. doi: 10.1038/msb.2010.60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poolman MG, et al. Challenges to be faced in the reconstruction of metabolic networks from public databases. Syst. Biol. 2006;153:379–384. doi: 10.1049/ip-syb:20060012. [DOI] [PubMed] [Google Scholar]
- Price ND, et al. Genome-scale models of microbial cells: evaluating the consequences of constraints. Nat. Rev. Microbiol. 2004;2:886–897. doi: 10.1038/nrmicro1023. [DOI] [PubMed] [Google Scholar]
- Reed J, et al. An expanded genome-scale model of Escherichia coli K-12 (ijr904 gsm/gpr) Genome Biol. 2003;4:R54. doi: 10.1186/gb-2003-4-9-r54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schnoes AM, et al. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput. Biol. 2009;5:e1000605. doi: 10.1371/journal.pcbi.1000605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suthers P, et al. A genome-scale metabolic reconstruction of Mycoplasma genitalium, ips189. PLoS Comput. Biol. 2009;5:e1000285. doi: 10.1371/journal.pcbi.1000285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ta HX, Holm L. Evaluation of different domain-based methods in protein interaction prediction. Biochem. Biophys. Res. Commun. 2009;390:357–362. doi: 10.1016/j.bbrc.2009.09.130. [DOI] [PubMed] [Google Scholar]
- Teusink B, et al. Analysis of growth of Lactobacillus plantarum WCFS1 on a complex medium using a genome-scale metabolic model. J. Biol. Chem. 2006;281:40041–40048. doi: 10.1074/jbc.M606263200. [DOI] [PubMed] [Google Scholar]
- Thomas G, et al. A fragile metabolic network adapted for cooperation in the symbiotic bacterium Buchnera aphidicola. BMC Syst. Biol. 2009;3:24. doi: 10.1186/1752-0509-3-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wagner A, Fell DA. The small world inside large metabolic networks. Proc. Biol. Sci. 2001;268:1803–1810. doi: 10.1098/rspb.2001.1711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Widiastuti H, et al. Genome-scale modeling and in silico analysis of ethanologenic bacteria Zymomonas mobilis. Biotechnol. Bioeng. 2011;108:655–665. doi: 10.1002/bit.22965. [DOI] [PubMed] [Google Scholar]
- Yeh I, et al. Computational analysis of Plasmodium falciparum metabolism: organizing genomic information to facilitate drug discovery. Genome Res. 2004;14:917–924. doi: 10.1101/gr.2050304. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.