Significance
Systems biology involves the development of large computational models of biological systems. The radical improvement of systems biology models will necessarily involve the automation of model improvement cycles. We present here a general approach to automating systems biology model improvement. Humans are eukaryotic organisms, and the yeast Saccharomyces cerevisiae is widely used in biology as a “model” for eukaryotic cells. The yeast diauxic shift is the most studied cellular transformation. We combined multiple software tools with integrated laboratory robotics to execute three semiautomated cycles of diauxic shift model improvement. All the experiments were formalized and communicated to a cloud laboratory automation system (Eve) for execution. The resulting improved model is relevant to understanding cancer, the immune system, and aging.
Keywords: artificial intelligence, machine learning, diauxic shift
Abstract
One of the most challenging tasks in modern science is the development of systems biology models: Existing models are often very complex but generally have low predictive performance. The construction of high-fidelity models will require hundreds/thousands of cycles of model improvement, yet few current systems biology research studies complete even a single cycle. We combined multiple software tools with integrated laboratory robotics to execute three cycles of model improvement of the prototypical eukaryotic cellular transformation, the yeast (Saccharomyces cerevisiae) diauxic shift. In the first cycle, a model outperforming the best previous diauxic shift model was developed using bioinformatic and systems biology tools. In the second cycle, the model was further improved using automatically planned experiments. In the third cycle, hypothesis-led experiments improved the model to a greater extent than achieved using high-throughput experiments. All of the experiments were formalized and communicated to a cloud laboratory automation system (Eve) for automatic execution, and the results stored on the semantic web for reuse. The final model adds a substantial amount of knowledge about the yeast diauxic shift: 92 genes (+45%), and 1,048 interactions (+147%). This knowledge is also relevant to understanding cancer, the immune system, and aging. We conclude that systems biology software tools can be combined and integrated with laboratory robots in closed-loop cycles.
Systems biology presents an extreme challenge to the traditional human-based scientific method (1, 2). The fundamental difficulty is the high degree of complexity of biological systems, where even simple “model” systems such as Escherichia coli and Saccharomyces cerevisiae have thousands of genes, proteins, and small molecules all interacting together in complicated spatial-temporal ways. This biological complexity implies a need for a similar complexity, probably beyond human intuitive understanding, in the corresponding systems biology models.
In the development of systems biology models, biological knowledge is integrated to form a model, experiments are planned and executed to test the model, the experimental results are used to refine the model, new biological knowledge is generated, and the cycle repeated (1). To radically improve existing system biology models, it will be necessary to execute hundreds/thousands of such cycles of model improvement. However, little current research completes even a single cycle. We therefore argue that greater automation is required, which will in turn require the combination and integration of multiple systems biology software tools into closed-loop cycles with laboratory robotics.
To evaluate the integration of software tools and laboratory robotics for systems biology we selected as a test case the diauxic shift of the yeast S. cerevisiae. This is the standard model system for understanding eukaryotic cellular transformation, and it is relevant to understanding cancer (Warburg effect), the immune system, and aging. In S. cerevisiae growing in batch culture on glucose with aeration a diauxic shift is commonly observed: During the first growth phase, yeast metabolizes glucose using the fermentative Embden–Meyerhof pathway to produce ethanol (3); when the glucose is exhausted, it switches to a fully respiratory metabolism utilizing the tricarboxylic acid cycle and oxidative phosphorylation in the mitochondria (3). This transition requires the large-scale remodeling of the metabolic apparatus (4). However, despite being one of the most studied of all eukaryotic cellular transformations, the diauxic shift is still very poorly understood, and existing systems biology models of this transformation could be greatly improved.
We combined system biology software for data analysis, model formation, experiment generation, experiment execution, model refinement, systems biology modeling, bioinformatics, laboratory robotic control, and semantic web techniques to execute three cycles of diauxic shift model improvement (Fig. 1A). The wide range of software and tools required to achieve this are shown in Fig. 2. (CoRegNet and CoRegFlux are available in bioconductor. All of the other software is available on request at LIPN GitLab).
Results
Modeling of the yeast diauxic shift is especially challenging because of the complexity of the biology involved, and the need to include subsystems operating at different time scales, and serving different purposes (5). The modeling requires integration of (1) a model of control of metabolism (cell signaling), and (2) a genome scale model of metabolism (1). The key difference between gene regulatory/signaling and metabolic networks is that the former carry signal flows, whereas metabolic pathways generate mass flows. We modeled the metabolic network as a biochemical (mechanistic) network based on the stoichiometry and reversibility of the reactions involved. Specifically, we chose the iMM904 model (6), (SI Appendix 1). This model is widely used, its structure is suitable for integration with signaling, and it is the most accurate model available for predicting growth phenotypes (7).
For cell-signaling modeling we used a two-time slice dynamic Bayesian network (DBN) with conditional linear Gaussian parameters (Fig. 3A). We selected this form of model because: it belongs to a well-studied family of continuous models, is easily interpretable in terms of activation and repression effects, and they enable the inference of gene states from known states in a versatile way. Each node in the model corresponds to either a regulatory protein or an enzyme, the former being the only type of node allowed to have children in the network, (SI Appendix 2 and 3). The starting point for our cell-signaling model was the model of Geistlinger et al. (8), which was assembled by compiling the findings of hundreds of scientific articles. We extracted the regulatory part of the model (Mz) and integrated this with iMM904m (Fig. 3A). Mz is of high quality in terms of dependencies recall (small number of false positive links), but it is relatively incomplete (missing links). Mz is also optimistic in that it predicts the occurrence of diauxic shifts for almost all strains with gene/protein deletions in the model.
In the first cycle of model improvement, the initial step was application of the bioinformatic program CoRegNet (9, 10) to identify genes potentially involved in control of the diauxic shift. CoRegNet integrates information from microarray experiments, regulatory interactions from the YEASTRACT database, and the S. cerevisiae Kinase and Phosphatase Interactome resource. CoRegNet uses a cooperative network based on shared transcription factor targets to identify coregulatory relationships from gene expression data (SI Appendix 4). We then applied a two-step model refinement process to its output (Fig. 3B): (i) We applied the ensemble network inference algorithm ELSA (Ensemble Learning of Spanning Arborescences) (11) to the Brauer microarray dataset (4), with Mz as a learning prior on the model space composed of the union of the Mz regulatory genes, the top 40 transcription factors identified by CoRegNet, and the top 40 kinases identified by CoRegNet (SI Appendix 5); (ii) We then applied a forward selection step to add to Mz edges that improve gene state predictions on the Brauer microarray dataset, using leave 1 out cross-validation. This generated model M1 (Fig. 1B).
At the start of the second cycle, we used tools to design experiments to provide the maximum amount of information to optimize the improvement of Mz to form M1, see Fig. 4A. We developed two tools for this task. The first tool is AdactiveFB (active learning based), which compares estimated protein/gene states (forward simulation) with the most likely protein/gene states consistent with the observed growth and metabolite state (backward simulation) (SI Appendix 6). In forward simulation a standard simulation from genes to phenotypes is performed, using both regulatory and metabolic simulators (SI Appendix 3). This produces an estimated time series of states for each gene in the DBN—as Gaussian distributions means and SDs. These forward simulations are compared with backward ones, i.e., simulations using phenotypes evidence to infer gene states (SI Appendix 6). Due to the unavailability of inferred states for several genes, the method used for backward simulation is designed to deal with partial evidence. The result of backward simulation is a set of backward time series for all of the genes in the regulatory model—also as Gaussian distributed means and SDs. Kullback–Leibler divergence is then calculated between the forward and backward Gaussian distributions (using their means and SDs) for each gene and each time point. This generates a divergence value for each (gene, time) pair. The genes selected for knockout experiment are those with the highest node divergence values. The strength of the AdactiveFB approach is that it focuses directly on optimizing the model rather than using a proxy. Its current main weakness is that the observed growth curve is the only phenotype used to inform backward simulation. Growth curve experiments are relatively robust (12), but they are not highly informative. In the future we plan to include many more phenotypic experiments.
The second tool, CoRegMine, initially uses CoRegNet (10) to infer a graph in which the vertices are coregulators labeled according to their influence profile (13), and the edges relate predicted coregulators. This graph is then processed by the graph mining tool MinerLC* (14) to extract subgraphs, each consisting of coregulators with similar influence profiles. Subgraph pairs were selected with a) antagonistic influence profiles, and b) edges relating coregulators from the two subgraphs, suggesting differential regulation of their common targets during the diauxic shift. The coregulators identified in this manner were then selected for use in experiments (SI Appendix 7). The strength of this approach is its use of background knowledge. Its weakness is that it does not directly focus on improving the model.
We used the cloud laboratory robotics system Eve (15) to execute the experiments selected by AdactiveFB and CoRegMine. Eve executed two complementary types of automated experiment on selected yeast deletion strains: determinations of growth curves (Fig. 5A) and glucose consumption curves (Fig. 5B). The observed growth curves were preprocessed, normalized, and descriptive parameters calculated (12) (SI Appendix 8). Periodic colorimetric resorafin-based assays were used to track glucose levels in the culture medium.
The next step in the cycle of model improvement is to refine the model based on the obtained new experimental results. We developed the tool Adarev for this task. As Eve’s experiments do not directly observe the time-series of protein/gene states, these need to be inferred from observations of growth and limited metabolite states. The approach used by Adarev is built on the identification of a local curve error reduction improvements to an input model based on simulation vs. real growth curves. Computational model simulation is very costly in terms of computer time. A prior scoring step was therefore included to predict local changes to the model likely to be interesting. Model refinement was restricted to selecting the edges to be removed from the cell-signaling submodels, although the addition of edges is also possible with the algorithm. The main steps in the model refinement algorithm are shown in Fig. 4B. (SI Appendix 9 and 10).
In total, three closed-loop system biology cycles were executed. In the first cycle, the model Mz was semiautomatically improved using bioinformatic data to form M1. To assess the utility of cycle 1, we compared Mz and M1’s predictions with the empirical growth curves observed by Eve using a set of yeast gene deletant strains not used to form M1 (SI Appendix 9). The 192 strains selected for the experiments were taken from genes identified by CoRegNet as potentially involved in the diauxic shift (10) and randomly selected regulatory genes (kinases and transcription factors) (SI Appendix 12). The experimental results demonstrate that M1 is significantly better than Mz (Fig. 6).
The second and third closed-loop cycles differed from the first in including new planned experiments (Fig. 1A). In the second closed loop, inference tools were run to generate experiments, the experiments were executed, and the models were refined. Two sets of experiments were generated to improve M1: set (a) of 80 hypothesis-led experiments designed with our tools (AdactiveFB and CoRegMine), and set (b), consisting of 80 randomly selected experiments. Eve executed both sets of experiments. Model M1-smart was refined from M1 based on the results of hypothesis-led experiments, and model M1-random was refined from M1 based from the random experiments (Fig. 1 A and B). The motivation for generating two separate sets of experiments was to test the belief that hypothesis-led experiments (experiments designed to improve/test models) are more efficient in systems biology model development than random/high-throughput experiments (16). The M1-smart model has 298 nodes and 1,760 edges (Fig. 1B). We compared M1-smart and M1 using their predictions for 281 test strains. The results show that M1-smart is significantly more accurate at prediction than M1 (Fig. 6). We ensured the maximal improvement of M1-random and made the comparison between M1-random and M1-smart as rigorous as possible by selecting the 80 randomly selected genes from known yeast regulators (kinases and transcription factors) (SI Appendix 13). The M1-random model has 298 nodes and 1,778 edges (Fig. 1B). To compare M1 and M1-random, we applied their predictions on the same 281 test strains. M1-random was significantly better at prediction than M1 (Fig. 6).
In the third cycle, the M1-smart and M1-random models were compared by generation and execution of experiments. To generate the “crucial” experiments used to compare M1-smart and M1-random we applied the tool Adana to select 81 deletant strains with the largest predicted postshift growth rate disagreement between M1-smart and M1-random. We found that M1-smart was significantly better than M1-random (Fig. 6). We therefore concluded, as expected, that hypothesis-led experiments are more efficient at improving systems biology models than high-throughput/random experiments.
An essential part of systems biology is the analysis of new models to provide biological insight (1). Our most accurate model, M1-smart adds a substantial amount of knowledge about the yeast diauxic shift: 92 extra genes (+45%) and 1,048 interactions (+147%). We used the Adana tool to rank the genes in terms of relative importance in the M1-smart model. To evaluate the biological insight possible from these additions, and to illustrate the biological utility of the knowledge generated by the system, we selected two genes highly ranked in M1-smart, but absent from Mz: MRK1 and TIS11 (SI Appendix 14). MRK1 (YDL079C) is homologous to human protein kinase glycogen synthase kinase-3 (GSK-3). Fig. 7A shows the fragment of M1-smart incorporating MRK1. GSK-3 genes are highly conserved and ubiquitous in eukaryotes and involved in differentiation, cell fate determination, and spatial patterning (17). These two highly homologous isoforms have been implicated in type II diabetes (Diabetes mellitus type 2), Alzheimer’s disease, inflammation, cancer, and bipolar disorder (17).
TIS11 is a member of the 12-O-tetradecanoylphorbol-13-acetate inducible sequence 11 family. TIS11 genes are involved in posttranscriptional gene regulation by micro-RNA (miRNA) and short interfering RNA (siRNA) (18, 19). Note that RNA processing is not explicitly included in M1-smart, and TIS11 was automatically incorporated as a putative transcription factor based on its zinc finger motif. This illustrates a strength of automating systems biology modeling: a human biologist would have excluded TIS11, yet its inclusion proved interesting, highlighting a possibly important role for RNA processing in the diauxic shift. Fig. 7B shows the fragment of M1-smart incorporating TIS11. In humans, changes in TIS11 expression have been associated with both the suppression and promotion of cancer, and with autoimmune diseases (18).
Formal languages promote the reproducibility and reusability of results, and the exchange of information between human scientists and computer systems. We developed a suite of complementary ontologies to support the application of systems biology tools and their integration: (1) AdaLab-meta, an ontology for the description of metadata about datasets; (2) AdaLab, a domain ontology to represent relevant biological entities in systems biology; and (3) Eve-CV, a controlled vocabulary that defines typical Eve experiments and experimental conditions (SI Appendix 15). When combined these ontologies consist of ∼20,000 RDF (Resource Description Framework) triples. We collected and formalized in RDF all of the bioinformatic data used for this study to form a knowledge base of 1,301,017 RDF triples grouped in five separate RDF graphs: imported genes, genes annotations, genes expressions, Eve strains, and relevant metadata. The data are accessible via the linked data web interface (SI Appendix 15). We developed a dedicated communication mechanism SciCom (Scientific Communication) to communicate information about experiments to Eve. The requests for experiments and experimental results are stored in an RDF triple store in Manchester that consists of 10,187,417 RDF triples combined in two graphs.
Discussion
The fundamental motivation for studying the diauxic shift in yeast (S. cerevisiae) is that it serves as a model for transformation in human cellular systems. It is therefore important to consider how well the methods can be scaled up for use in mammalian systems. This scaling up entails two main challenges: ensuring the same experimental reproducibility as is achievable in yeast, and scaling up the computational methods. We consider experimental reproducibility to be the most difficult of these challenges (20). For the scaling up of computational methods, the different parts of the software pipeline have different sensitivities to an increase in input network size (SI Appendix 16), but all of the methods scale polynomially, implying that the increase in size and complexity associated with the move to mammalian systems should be tractable with our approach.
We have successfully combined multiple systems biology software tools and laboratory robotics to execute three cycles of improvement for a model of the yeast diauxic shift. The cycles were not fully automated, as in the Robot Scientists Adam (12) and Eve (15), as the automation of systems biology is very much more complicated. However, full automation will be necessary to execute the hundreds or thousands of model improvement cycles required. The achievement of this full automation will require the software tools to be more robust and more modularly designed. Many of the software tools we have used are based on techniques originating in artificial intelligence (AI), especially machine learning: CoRegNet, CoRegMine, ELSA, ActiveFB, MinerLC*, Adarev, and Adana (Fig. 2). However, more advanced ideas from AI will be required to improve performance (21). For example, the tools have no high-level understanding of what they are doing, just as chess programs do not known that they are playing chess. One approach to providing them with such an understanding would be to give the system high-level goals to achieve, along with a higher-level planning ability. Another fundamental enhancement would be to give the AI tools the ability to communicate goals and intentions to human scientists.
In conclusion, we foresee a future in which combinations of software tools, laboratory automation, and human scientists will work together to create systems biology models that fully reflect and predict the underlying biology.
Supplementary Material
Acknowledgments
We received support from the CHIST-ERA AdaLab project: The Engineering and Physical Sciences Research Council (EPSRC), UK (EP/M015661/1), ANR-14-CHR2-0001-01.
Footnotes
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1900548116/-/DCSupplemental.
References
- 1.Alberghina L., Westerhoff H. V., Systems Biology: Definitions and Perspectives (Springer-Verlag, 2005). [Google Scholar]
- 2.Omenn G. S., Grand challenges and great opportunities in science, technology, and public policy. Science 314, 1696–1704 (2006). [Google Scholar]
- 3.Dickinson J. R., “Carbon metabolism” in The Metabolism and Molecular Physiology of Saccharomyces cerevisiae, Dickinson J. R., Schweizer M., Eds. (Taylor & Francis, Philadelphia, PA, 1999), pp. 42–103. [Google Scholar]
- 4.Brauer M. J., et al. , Coordination of growth rate, cell cycle, stress response, and metabolic activity in yeast. Mol. Biol. Cell 19, 352–367 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Banos D. T., Trébulle P., Elati M., Integrating transcriptional activity in genome-scale models of metabolism. BMC Syst. Biol. 11 (suppl. 7), 134 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Mo M. L., Palsson B. O., Herrgård M. J., Connecting extracellular metabolomic measurements to intracellular flux states in yeast. BMC Syst. Biol. 3, 37 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Heavner B. D., Price N. D., Comparative analysis of yeast metabolic network models highlights progress, opportunities for metabolic reconstruction. PLoS Comput. Biol. 11, e1004530 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Geistlinger L., Csaba G., Dirmeier S., Küffner R., Zimmer R., A comprehensive gene regulatory network for the diauxic shift in Saccharomyces cerevisiae. Nucleic Acids Res. 41, 8452–8463 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Elati M., et al. , LICORN: Learning cooperative regulation networks from gene expression data. Bioinformatics 23, 2407–2414 (2007). [DOI] [PubMed] [Google Scholar]
- 10.Nicolle R., Radvanyi F., Elati M., CoRegNet: Reconstruction and integrated analysis of co-regulatory networks. Bioinformatics 31, 3066–3068 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Coutant A., Rouveirol C., “Network inference of dynamic models by the combination of spanning arborescences” in Journées Ouvertes en Biologie, Informatique et Mathématiques (JOBIM 2017). https://cdjemiel.github.io/pdf/2017-jobim-actes.pdf. Accessed 12 August 2019.
- 12.King R. D., et al. , The automation of science. Science 324, 85–89 (2009). [DOI] [PubMed] [Google Scholar]
- 13.Nicolle R., Elati M., Radvanyi F., “Network transformation of gene expression for feature extraction” in 11th International Conference on Machine Learning and Applications (ICMLA). https://www.researchgate.net/publication/235898031_Network_Transformation_of_Gene_Expression_for_Feature_Extraction. Accessed 12 August 2019.
- 14.Soldano H., Santini G., Bouthinon D., “Formal concept analysis of attributed networks” in Formal Concept Analysis in Social Network Analysis, Lecture Notes in Social Networks, Missaoui R., Obiedkov S., Kuznetsov S., Eds. (Springer, 2017), pp. 143–170. [Google Scholar]
- 15.Williams K., et al. , Cheaper faster drug development validated by the repositioning of drugs against neglected tropical diseases. J. R. Soc. Interface 12, 20141289 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Medawar P. B., Advice to a Young Scientist (Basic Books, 1981). [Google Scholar]
- 17.Ali A., Hoeflich K. P., Woodgett J. R., Glycogen synthase kinase-3: Properties, functions, and regulation. Chem. Rev. 101, 2527–2540 (2001). [DOI] [PubMed] [Google Scholar]
- 18.Baou M., Jewell A., Murphy J. J., TIS11 family proteins and their roles in posttranscriptional gene regulation. J. Biomed. Biotechnol. 2009, 634520 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ma Q., Herschman H. R., The yeast homologue YTIS11, of the mammalian TIS11 gene family is a non-essential, glucose repressible gene. Oncogene 10, 487–494 (1995). [PubMed] [Google Scholar]
- 20.Baker M., 1,500 scientists lift the lid on reproducibility Nature 533, 452–454 (2016). [DOI] [PubMed] [Google Scholar]
- 21.Kitano H., Artificial intelligence to win the Nobel Prize and beyond: Creating the engine of scientific discovery. AI Mag. 37, 39–49 (2016). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.