PhenoMapping: a protocol to map cellular phenotypes to metabolic bottlenecks, identify conditional essentiality, and curate metabolic models

Anush Chiappino-Pepe; Vassily Hatzimanikatis

doi:10.1016/j.xpro.2020.100280

. 2021 Jan 22;2(1):100280. doi: 10.1016/j.xpro.2020.100280

PhenoMapping: a protocol to map cellular phenotypes to metabolic bottlenecks, identify conditional essentiality, and curate metabolic models

Anush Chiappino-Pepe ^1,^2,^3,^4,^5,^∗, Vassily Hatzimanikatis ¹

PMCID: PMC7829271 PMID: 33532729

Summary

Targeted identification of cellular processes responsible for a phenotype is of major importance in guiding efforts in bioengineering and medicine. Genome-scale metabolic models (GEMs) are widely used to integrate various types of omics data and study the cellular physiology under different conditions. Here, we present PhenoMapping, a protocol that uses GEMs, omics, and phenotypic data to map cellular processes and observed phenotypes. PhenoMapping also classifies genes as conditionally and unconditionally essential and guides a comprehensive curation of GEMs.

For complete details on the use and execution of this protocol, please refer to Stanway et al. (2019) and Krishnan et al. (2020).

Subject areas: Bioinformatics, Metabolism

Graphical Abstract

Highlights

•
Systematic identification of cellular processes causing phenotypes
•
Decoding nutrient usage from genetic screens, as shown in two parasites
•
Curation of two genome-scale models leads to 80% accuracy in essentiality predictions
•
Classification of conditional essentiality will guide drug targeting strategies

Before you begin

In this first section, we present a brief relation of PhenoMapping to prior art and the preparatory steps to perform a PhenoMapping analysis (Figure 1). We discuss a set of decisions to make and how they impact the subsequent PhenoMapping analysis. In addition, we introduce the input data needed and how to set up the GEM and software. To adapt to all ranges of expertise in metabolic modeling, we provide links to the troubleshooting section, where we describe technical details to perform the related steps. We present as an example the application of these preparatory steps to Plasmodium berghei in the section Expected outcomes. These steps were applied similarly in Toxoplasma gondii and are generalizable to any organism and study case. For examples of studies validating the results and insights obtained following the PhenoMapping protocol, please refer to Stanway et al., 2019 and Krishnan et al., 2020.

Preparatory steps for a PhenoMapping analysis

Color code is consistent with related steps in the main PhenoMapping workflow. We include an approximate assessment of the timing each step takes.

Relation of method to prior art

Identifying cellular processes responsible for a phenotype is especially complex and relevant in biological systems. Genome-scale models (GEMs) are widely used to integrate all available biochemical information of an organism and various types of omics data to study the metabolic function at different conditions. The protocol described here builds on three decades of method development to construct and analyze GEMs. It complements available protocols (Thiele and Palsson, 2010) and tools (Agren et al., 2013; Devoid et al., 2013; Heirendt et al., 2019, 2019; Lieven et al., 2020; Machado et al., 2018; Salvy et al., 2018; Wang et al., 2018) for high-quality reconstruction and analysis of GEMs. This protocol provides a systematic guideline to identify cellular bottlenecks underlying phenotypes. It also describes how to use the knowledge about metabolic bottlenecks toward the understanding of conditional essentiality and curation of GEMs, as recently shown (Krishnan et al., 2020; Stanway et al., 2019).

Alternative methods to increase the predictive accuracy of GEMs include automatized approaches like AMMEDEUS (Medlock and Papin, 2020), GrowMatch (Kumar and Maranas, 2009) or GlobalFit (Hartleb et al., 2016), and others less automatized like RING (Sohn et al., 2012). The solutions provided by these methods may remain limited to the physico-chemical constraints integrated into the GEM. This protocol suggests a systematic classification and evaluation of such physico-chemical constraints for the identification and curation of a broader range of knowledge gaps in GEMs. Moreover, through the systematic classification and analysis of bottlenecks defined in this protocol, one gains important biological insights like substrates linked to conditional gene essentiality that had remained rather unexplored in silico so far. Currently, tools like COBRA (Heirendt et al., 2019), RAVEN (Wang et al., 2018), modelSEED (Devoid et al., 2013), KBase (US Department of Energy Systems Biology Knowledgebase, http://kbase.us), CarveMe (Machado et al., 2018), TFA (Salvy et al., 2018) etc. are widely used to construct and analyze GEMs. PhenoMapping, as defined in this protocol and accompanying GitHub repository (www.github.com/EPFL-LCSB/phenomapping), can be applied in combination with any of those tools. This protocol provides rigorous details for a PhenoMapping study design, integrative analysis using omics and phenotypic data, and comprehensive evaluation of results.

Organism and cellular state choice

Timing: 1–10 min

1.
Select an organism and strain or cell line of interest.
2.
Select the cellular state(s) of interest.
- a.
  Select a life-stage (if applicable).
- b.
  Select a specific time point in the life-stage.

CRITICAL: the cellular state selected will determine the metabolic state of your organism or cell, which further restricts the gathering of data (see section on Phenotypic, media, and omics data collection) and selection of a cellular objective (see section Cellular objective definition) in the PhenoMapping analysis. We recommend the “safe and easy” selection of a highly metabolically active state for which the cellular objective can be represented as a “desire to maximize growth.”

Metabolic model selection

Timing: ~1 day

Note: the time spent to select a GEM varies dramatically depending both on the experience of the user with the organism of study and analysis of GEMs, and on the availability and quality of the GEMs.

3.
Search in databases like BiGG (King et al., 2015), modelSEED (Henry et al., 2010), KBase (US Department of Energy Systems Biology Knowledgebase, http://kbase.us), LCSB database (LCSB, 2020), etc. or publications for a GEM of your organism of interest.

CRITICAL: if there exists no GEM for your organism and strain, you may want to follow standard protocols to construct a GEM; Troubleshooting 1. If there exist multiple GEMs, you need to select one for the subsequent analysis; Troubleshooting 2. Any additional analysis and evaluation of the GEMs prior to PhenoMapping are an alternative with a consequent time extension.

4.
Select a GEM for your organism and strain of interest.

Note: in the subsequent steps the GEM will be thermodynamically curated, prepared, and initialized to be ready for a PhenoMapping analysis.

Thermodynamic curation

Timing: ~1 day

Note: the time spent to thermodynamically curate a GEM varies considerably depending on the tools used to perform this task. In addition, the time will vary based on the extent to which the user wants to a posteriori evaluate and curate the performance of the GEM under thermodynamic constraints. If no systematic tool is used to curate the GEM thermodynamically, variables like GEM size will affect the timing too.

5.
Include all thermodynamic data in the GEM as necessary within the TFA framework (Henry et al., 2006, 2007; Jankowski et al., 2008; Salvy et al., 2018) to perform a thermodynamically consistent flux balance analysis; Troubleshooting 3.
6.
Verify the GEM has all fields required to be thermodynamically curated in a systematic fashion following the steps defined in the section Software setup.

Phenotypic, media, and omics data collection

Timing: 5–8 h

Note: the time spent in a literature search and mapping of the data to the GEM greatly varies depending on the organism of study, amount of data available, and automatization of the mapping.

7.
Get phenotypic data for the organism and cellular state to study.
- a.
  List all genes in the GEM and map the collected data of in vivo phenotypes (e.g., essential or dispensable upon single gene knockout).
8.
Gather information about the media composition at the cellular state.
- a.
  List all extracellular metabolites in the GEM and map the available information about the availability of each metabolite at the cellular state.
9.
Assemble available metabolomics data for the organism and cellular state to study.
- a.
  List the set of metabolites included in the GEM and map the corresponding concentration ranges (minimum and maximum absolute values).
10.
Assemble available RNA-seq or proteomics datasets for the organism and cellular state to study.
- a.
  List all genes in the GEM and map the corresponding and unique RNA or protein level.

Medium definition

Timing: 1–10 min

11.
Select a media composition for the PhenoMapping analyses.

CRITICAL: we recommended to select a rich medium at this point. We define a rich medium when a broad range of metabolites (if possible, all extracellular) are allowed to be taken up by the GEM. Selecting a rich medium at this stage will allow PhenoMapping to map substrate availability to essentiality (conditional essentiality). PhenoMapping will only be able to map conditionally essential genes and the responsible substrates for the genes’ essentiality, if those substrates are made available (can be taken up) at this step (see section Metabolic model contextualization).

12.
Define the medium composition in the GEM with the desired maximum uptake rates allowed; Troubleshooting 4.

Genetic background and essentiality definition

Timing: 1–5 min

13.
Select the type of essentiality analysis to perform.
- a.
  Single gene knockout.
- b.
  Single reaction knockout.
- c.
  Multiple gene knockout, or multiple reaction knockout.

Note: by default, PhenoMapping performs single gene knockout. This is because the phenotypic data available are normally single gene knockout data. If a single gene knockout analysis is selected, PhenoMapping will map bottlenecks to individual essential genes. A single gene knockout analysis is normally preferred to a multiple knockout analysis because it is faster (given the current available methods to unbiasedly perform double or multiple gene knockout analysis in silico). In this protocol, we will refer to single essential genes. And, the same concepts apply to any set of essential genes or reactions identified in silico as decided at this step.

14.
Decide the genetic background of the in silico organism (GEM) on which the PhenoMapping analysis will be performed.
- a.
  Keep the wild-type genetic background to perform PhenoMapping analyses of single essential genes.
- b.
  Define in silico deletion strains, i.e., with a deleted gene or reaction or multiple deleted genes or reactions (even if that does not match the genetic background of the organism chosen) to efficiently identify bottlenecks of sets of redundant genes.

Note: a scenario with a knockout background might be desired when one knows that a gene is part of a synthetic lethal pair and one desires to map bottlenecks to the synthetic lethal pair. This strategy will not require a double knockout analysis within PhenoMapping, which is more time consuming and computationally expensive.

15.
Define an essentiality threshold or a percentage of the optimal value of the objective function. The knockout that renders a value of objective function below this threshold is considered as lethal (see Essentiality prediction section).
- a.
  PhenoMapping uses by default an essentiality threshold of 0.1, which indicates that every gene whose knockout leads to a growth reduction of 90% or more with respect to a reference value (normally the wild-type growth) is considered essential.

Cellular objective definition

Timing: 1–5 min

Note: the time spent to define the objective function varies considerably depending on three main factors: the type of objective function chosen; the feasibility of the GEM for the given objective function under the defined conditions; and the experience of the user to accurately formulate the desired objective function. Difficulties in any of these points can expand the timing to days and weeks.

16.
Select and define in the GEM an objective function that represents the cellular objective at the state of study (Schuetz et al., 2007). By default, PhenoMapping uses maximization of growth as the objective function (Feist and Palsson, 2010); Troubleshooting 5.
17.
Verify that it is possible to obtain a solution for the objective function selected in the medium and genetic background defined; Troubleshooting 6.

Accuracy metric definition

Timing: 1–5 min

18.
Familiarize with the description of knockouts based on the predicted outcome using the GEM:
- a.
  Positives: the GEM predicts little or no effect on wild-type growth (positive growth) upon knockout of the gene
- b.
  Negatives: the GEM predicts a negative effect on wild-type growth upon knockout of the gene
19.
Familiarize with a contingency matrix for the comparison of predictions and data, which includes the following definitions:
- a.
  True positives (TP): dispensable both in silico and in vivo.
- b.
  True negatives (TN): essential both in silico and in vivo.
- c.
  False positives (FP): dispensable in silico and essential in vivo.
- d.
  False negatives (FN): essential in silico and dispensable in vivo.

20.
Select a metric to assess the accuracy of your GEM in the essentiality prediction. These metrics can be systematically computed within PhenoMapping (see section Expected outcomes):
- a.
  Matthew correlation coefficient (MCC):
  $M C C = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}$
- b.
  Overall accuracy:
  $A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}$
- c.
  Negative prediction rate (NPR):
  $N P R = \frac{T N}{T N + F N}$
- d.
  Positive prediction rate (PPR):
  $P P R = \frac{T P}{T P + F P}$
- e.
  Sensitivity:
  $S e n s i t i v i t y = \frac{T P}{T P + F N}$
- f.
  Specificity:

S p e c i f i c i t y = \frac{T N}{T N + F P}

Note: the MCC and overall accuracy tend to be used as the main metrics of accuracy assessment. However, they assume that incorrect predictions, i.e., FPs and FNs, are equally “bad,” and this is arguably not the case. Given a reliable and high-quality dataset of phenotypes, one would normally prefer to work with a GEM that has less FNs than FPs. Such GEM is not over-constrained, so it does not incorrectly predict essential genes, but it lacks a set of constraints to increase the number of true essentiality predictions.

For example, when one uses a highly curated life-stage agnostic GEM, one expects that it already contains all biochemical information available about the organism. To make it life-stage specific, we should a priori only add physico-chemical constraints that are context-specific. These constraints should increase the number of correctly identified conditionally essential genes, while not increasing the incorrect essentiality predictions (FNs). This means, to generate context-specific GEMs we may try to increase the positive prediction rate while keeping the negative prediction rate constant.

Data setup

Timing: 1 h

This section includes a suggestion on the setup of data within the PhenoMapping repository. The data were set up in this format for the example scripts and PhenoMapping analyses of the Plasmodium berghei metabolic model (iPbe) (Stanway et al., 2019) and Toxoplasma gondii metabolic model (iTgo) (Krishnan et al., 2020).

21.
GEM setup
- a.
  Save the GEM in MATLAB (.mat) format in the folder “models” of the PhenoMapping directory.
- b.
  Generate a folder with your model name in the PhenoMapping subfolder “tests/ref.”
22.
Phenotypic data setup
- a.
  Format the phenotypic data in a 2-column csv file: genes names (column 1) and observed phenotype upon knockout (column 2).
- b.
  Store the csv file with the phenotypic data in the PhenoMapping subfolder tests/ref/modelname.

CRITICAL: verify gene identifiers match those included in the model in the field “genes.” If not available in the GEM, a warning will state that these genes are not in the GEM and the data will not be considered for further analysis.

23.
Media data setup
- a.
  Format the media data in a 2-column csv file: metabolite names or identifiers (column 1) and information about their availability or uptake (column 2).
- b.
  Store the csv file with media data in the PhenoMapping subfolder “tests/ref/modelname.”

CRITICAL: verify metabolite identifiers match those included in the model in the field “mets” or “metNames.” If not available in the GEM, a warning will state that these metabolites are not present in the GEM and the data will not be integrated or considered for further analysis.

24.
Metabolomics data setup
- a.
  Format the metabolomics data as a 3-column csv file: names of metabolites for which concentration data are available (column 1) and concentration data formatted as explained below in columns 2 and 3.
- b.
  Compute the minimum (average minus standard deviation) and maximum (average plus standard deviation) values of concentration measured.
- c.
  Convert the units of the metabolomics data into mol/L_cell.
- d.
  Add the concentration values to the csv file: minimum concentration values (column 2) and maximum concentration values (column 3).
- e.
  Store the csv file with metabolomics data in the PhenoMapping subfolder tests/ref/modelname.

CRITICAL: verify metabolite identifiers match those included in the model in the field “mets” or “metNames.” If not available in the GEM, a warning will state that these metabolites are not present in the GEM and the data will not be integrated or considered for further analysis.

25.
Transcriptomics or proteomics data setup
- a.
  Format the transcriptomics or proteomics data in a 2-column csv file: genes names (column 1) and a unique value of measured RNA or protein level (column 2).
  Note: the units of the RNA-seq or proteomics measurements are not relevant at this point as long as all RNAs or proteins measured share these units. This is because the GEMs used in PhenoMapping do not integrate the concentration of RNAs or proteins as variables. PhenoMapping will evaluate the distribution of RNA or protein levels across all genes in the GEM and will discretize these distributions in three groups using TEX-FBA (Pandey et al., 2019); lowly expressed, medium expression, and highly expressed (see the section describing the transcriptomics analysis and TEX-FBA parameters definition).
- b.
  Store the csv file with transcriptomics or proteomics data in the PhenoMapping subfolder tests/ref/modelname.

Software setup

Timing: 5–30 min

This section includes a suggestion on the setup of paths and preprocessing of the GEM for a PhenoMapping analysis using a sample script. There are many alternatives, and some are discussed in more detail in the tutorials script within the PhenoMapping repository and in this protocol in the Troubleshooting section.

Note: PhenoMapping requires MATLAB, CPLEX, and the GitHub repositories matTFA, TEX-FBA, and PhenoMapping. Links to these have been included in the Materials and equipment section.

26.
Prepare a settings script using as reference the templates provided in the PhenoMapping subfolder tests, i.e., settings_ipbeblood.m, settings_ipbeliver.m, settings_itgo.m.
- a.
  Provide paths, file names, and variable names for the GEM and data to be used in PhenoMapping.
- b.
  Select whether the GEM should be thermodynamically curated. This is an input of the initTestPhenoMappingModel function.
27.
Run the settings script to (1) verify that all paths to matTFA, TEX-FBA, and CPLEX are found, (2) check that all data files are found, and (3) preprocess the GEM for PhenoMapping analysis.

CRITICAL: this step will highlight any problem to add paths to CPLEX, matTFA, or TEX-FBA. This step will also spot any missing information or field in the GEM as required for PhenoMapping; Troubleshooting 7.

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

P. berghei relative growth rate phenotypes in blood stages	Bushell et al., 2017	https://doi.org/10.1016/j.cell.2017.06.030
P. berghei relative growth rate phenotypes in liver stages	Stanway et al., 2019	https://doi.org/10.1016/j.cell.2019.10.030
P. berghei RNA-seq data in blood stages	Otto et al., 2014	https://doi.org/10.1186/s12915-014-0086-0
P. berghei RNA-seq data in liver stages	Caldelari et al., 2019	https://doi.org/10.1186/s12936-019-2968-7
Compiled metabolomics dataset from P. falciparum	Chiappino-Pepe et al., 2017	https://doi.org/10.1371/journal.pcbi.1005397
GEM of P. berghei iPbe	Stanway et al., 2019	https://doi.org/10.1016/j.cell.2019.10.030
T. gondii relative growth rate phenotypes in tachyzoites	(Sidik et al., 2016)	https://doi.org/10.1016/j.cell.2016.08.019
T. gondii tachyzoite RNA-seq data	(Hehl et al., 2015); ToxoDB v45	www.toxodb.org
GEM of T. gondii iTgo	Krishnan et al., 2020	https://doi.org/10.1016/j.chom.2020.01.002

Software and algorithms

PhenoMapping	www.github.com/EPFL-LCSB/phenomapping	1.0
TEX-FBA	www.github.com/EPFL-LCSB/texfba	1.0
matTFA	www.github.com/EPFL-LCSB/matTFA	1.0
MATLAB	Mathworks (https://www.mathworks.com/products/matlab.html)	R2016a - R2019a
CPLEX	https://www.ibm.com/analytics/cplex-optimizer	12.8
COBRA Toolbox (used updated version within matTFA)	www.github.com/EPFL-LCSB/matTFA	1.0

Open in a new tab

Materials and equipment

Software

MATLAB (MathWorks: https://www.mathworks.com/products/matlab.html)

Alternatives: While the current implementation of PhenoMapping is in MATLAB, the PhenoMapping rationale and workflow is extendable to any other programming language like python.

CPLEX (IBM: https://www.ibm.com/analytics/cplex-optimizer.html)

Alternatives: While the current implementation of PhenoMapping uses CPLEX, other available solvers like gurobi could be implemented and used.

matTFA (GitHub: www.github.com/EPFL-LCSB/matTFA)

Alternatives: An implementation of matTFA in python exists and it is called pyTFA (Salvy et al., 2018).

TEX-FBA (GitHub: www.github.com/EPFL-LCSB/texfba)

Alternatives: While the current implementation of TEX-FBA is in MATLAB, the TEX-FBA formulation is extendable to any other programming language like python.

PhenoMapping (GitHub: www.github.com/EPFL-LCSB/phenomapping)

CRITICAL: PhenoMapping only requires CPLEX, matTFA, and TEX-FBA to be on the MATLAB path for optimization analysis. Further instructions on how to optimally handle paths in PhenoMapping are available in Troubleshooting 7.

Data

The data used in PhenoMapping are collected from separate studies. Here, we summarize data types for which a PhenoMapping analysis is currently implemented. We define two types of data depending on their use in PhenoMapping: data type 1 involves phenotypic data used for a comparison with the GEM predictions to assess the accuracy of the GEM; and data type 2 involves other datasets like omics data and media data integrated into the GEM to contextualize it. None of the datasets but the GEM is truly essential since one can perform a purely in silico analysis of phenotypes and bottlenecks with a metabolic model in PhenoMapping. However, broader biological insights are achieved when experimental data are integrated into the PhenoMapping pipeline. We define how essential each dataset is (++++, necessary; +++, strong; ++, medium; +, low) for a PhenoMapping analysis, and the suggested labels or values for the data.

Data (PhenoMapping data type)	Degree of requirement	Suggested labels
GEM	++++	MATLAB format with standard fields defined in constraint-based modeling
Phenotypic data (data type 1)	+++	essential; non-essential; (slow^a)
Media data (data type 2)	+	available; non-available; (unknown)
Thermodynamic data (data type 2)	++	Thermodynamically curated GEM (Salvy et al., 2018)
Metabolomics data (data type 2)	++	Absolute values (mol/L_cell). Minimum and maximum measured or allowed concentration values per metabolite
RNA-seq data (data type 2)	++	TPMs or absolute values. Unique RNA level per gene

Open in a new tab

Some experimental datasets like those obtained for the blood and liver stages of the Plasmodium development (Bushell et al., 2017; Stanway et al., 2019) may include “slow” phenotypes. These genes might be considered as essential or dispensable in PhenoMapping depending on the GEM context, as explained in the next sections.

CRITICAL: PhenoMapping identifies metabolic bottlenecks responsible for the essentiality of a gene. PhenoMapping can map bottlenecks to in silico essential genes but stronger biological insights can be obtained when experimentally observed essential genes or phenotypic data (data type 1) are used in the pipeline.

Alternatives: To study context-specific functions of the cell, PhenoMapping integrates omics and media data (data type 2) into a GEM. If no omics data are available, PhenoMapping will only map context-specific essential genes to substrate availability. If omics data are available, PhenoMapping will identify which minimum alternative sets of concentration levels (as measured in the omics datasets) can explain an observed gene essentiality or phenotype (data type 1).

Step-by-step method details

To enable identification of cellular processes underlying phenotypes, PhenoMapping leverages all available biochemical information of an organism as integrated into a GEM, as well as omics (e.g., metabolomic, transcriptomic) and phenotypic data in one or more conditions or life stages. These measurements are used along with the metabolic model of the organism of interest to study context-specific metabolic function and essentiality, identify sets of conditions that explain phenotypes, and if necessary further curate the GEM. Comparison between essentiality predictions and phenotypic data allows to assess accuracy of the GEM. Some of the steps have been extensively described previously, and some were recently first introduced (Chiappino-Pepe et al., 2017; Stanway et al., 2019). Here, we present a comprehensive protocol describing the proper and practical integration of all relevant PhenoMapping steps, as well as advice on checks and troubleshooting, to allow efficient and accurate analysis of origin of phenotypes and curation of GEMs (Figure 2). We define both setup and analysis steps. In a setup step, we conceptualize a study or perform changes in the GEM that do not involve any analysis. These steps are GEM- and case-specific and require mental or manual work. In an analysis step, we perform actual analysis on the GEM. All analysis steps are automatized within PhenoMapping. We emphasize the applications of this protocol to study eukaryotic pathogens like malaria (Plasmodium) and toxoplasma (Toxoplasma) parasites for which there is a higher uncertainty in the metabolism and growing conditions. This protocol can be easily adjusted for other complex eukaryotic organisms like human cells and also prokaryotic systems.

The PhenoMapping workflow showing steps (colored boxes) and input data (boxes marked with databases)

A GEM is a necessary input (solid arrow) to the workflow and phenotypic and omics data are optional inputs (dashed arrows). When phenotypic data are not available a purely *in silico* analysis of predicted phenotypes and bottlenecks will be performed. The PhenoMapping workflow involves five steps to map phenotypes to bottlenecks: PhenoMapping study design (blue), metabolic model contextualization (lila), essentiality prediction (pink), accuracy assessment (red), and bottleneck identification (green). An additional step can be added to curate the metabolic model if needed (yellow). The PhenoMapping workflow is often iterative (feedback loop). We include an approximate assessment of the timing each step takes.

PhenoMapping study design

Timing: 1–30 min

The PhenoMapping workflow is summarized in Figure 2. The first step is a setup step and involves the design of a PhenoMapping analysis. PhenoMapping classifies the information included in a GEM in two classes: organism-specific and context-specific information. Each class has also its layers of information that correspond to constraints in a GEM with different types of physico-chemical meaning and hierarchy (Figure 3). In the PhenoMapping study design, we use the knowledge of these layers to select pieces of information and data (among all datasets collected in the Before you begin section) to integrate into a GEM for the next analyses.

Note: the hierarchy of the organism-specific layers in PhenoMapping makes it possible to distinguish between two classes of incorrect essentiality predictions or false negatives: those arising due to a lack of information in the model, like missing genome annotations, and those arising due to an incorrectly defined pre-assumed transport/reaction directionality or enzymatic irreversibility (ad hoc constraints). PhenoMapping suggests adding first all possible missing gene annotations and metabolite transports, and later introducing ad hoc irreversibility constraints when needed. The hierarchy of the context-specific layers in PhenoMapping is suggested based on the uncertainty of the data and methodology to integrate such data into the GEM. For example, data on media composition tend to be more reliable than measures of RNA levels. In addition, simulating the effect of a lack of substrate in the medium is more straightforward than simulating the effect of an RNA level on the cellular physiology using a GEM.

1.
Select the type of PhenoMapping analysis to perform:
- a.
  Organism-specific PhenoMapping analysis to identify unconditionally essential genes and curate a generic GEM. This analysis also maps phenotypes to the following layers of information: (1) metabolic functions annotated to the genome, (2) enzyme localization, (3) transportability of metabolites between intracellular compartments, (4) enzymatic irreversibility, and (5) a set of metabolic tasks related to biomass production (Figure 3).
- b.
  Context-specific PhenoMapping analysis to study conditional essentiality and generate a context-specific GEM. This analysis also maps phenotypes to the following layers of information: the (6) media composition or uptakes, (7) thermodynamic feasibility at some given intracellular conditions including metabolite concentrations, (8) gene expression, and (9) transcriptional regulation or regulation of expression between isoenzymes (Figure 3).

2.
Within an organism-specific or context-specific analysis, select the layer(s) of information to keep within the GEM:
- a.
  Biochemistry layer. Analysis of the biochemistry layer serves to identify essential genes as defined uniquely by the genome annotation and metabolic capabilities included in the GEM. This analysis will not account for any physiological constraint in cellular metabolism and hence allows to identify metabolic gaps purely due to missing functional annotations.
- b.
  Localization layer (in eukaryotes). A comparative analysis between the biochemistry and localization layer will identify genes that become essential due to compartmentalization of enzymes and metabolic pathways. The localization layer includes (on the top of the biochemistry layer) localization of enzymes and metabolites, and allows transport of all metabolites without a phosphate, acyl-carrier protein (ACP), and CoA moiety between cytosol and other compartments. If there is experimental evidence that there exists a transporter for a metabolite with a phosphate, acyl-carrier protein, and CoA moiety, this should be allowed.
  Note: there exists arguably some uncertainty in transport mechanisms and annotated transporters in cells and cellular organisms. In PhenoMapping, as done before (Chiappino-Pepe et al., 2017; Krishnan et al., 2020; Stanway et al., 2019; Tymoshenko et al., 2015), we assume that any metabolite that contains a phosphate, acyl-carrier protein (ACP), and CoA moiety might not easily diffuse (by simple diffusion) through lipid bilayer membranes and requires a specialized transporter or transport mechanism. The exception is free phosphate that cells normally take up.
- c.
  Intracellular transportability layer (in eukaryotes). A comparative analysis between the localization and intracellular transportability layers will highlight genes that become essential when there exist constraints on intracellular transportability. Beside the localization information, this analysis will integrate ad hoc directionalities for transporters or blocked transports.
- d.
  Enzymatic irreversibility layer. Analysis of the enzymatic irreversibility layer can be compared to the biochemistry (in prokaryotes) or intracellular transportability (in eukaryotes). Such comparison suggests genes that become essential due to irreversible biotransformations (Ataman and Hatzimanikatis, 2015). This layer includes ad hoc and pre-assigned reaction directionalities that should be applicable in every growing context and life-stage.
  Note: many GEMs tend to include pre-assumed reaction directionalities as ad hoc reaction bounds. While this approach might increase the prediction accuracy at a specific growth condition, it can also limit the usage of such GEMs in a different scenario and the identification of actual bottlenecks responsible for a phenotype. For example, there might be a set of metabolites whose concentrations are responsible for those reaction directionalities (see Metabolomics layer); we would not identify these bottleneck metabolites when the source of the reaction directionality is not TFA but ad hoc reaction bounds. We recommend defining generic GEMs with the minimum information on a context such that they serve as platforms for integration of media composition and omics data and generation of context-specific GEMs. Such strategy was followed before (Krishnan et al., 2020; Stanway et al., 2019) with the generation of generic GEMs like iPbe and iTgo and context-specific GEMs like iPbe-blood, iPbe-liver, and iTgo-tachy.
- e.
  Metabolic tasks. Metabolic tasks serve to evaluate in a modular way how metabolism works, as described before (Agren et al., 2013; Carey et al., 2017; Chiappino-Pepe et al., 2017; Richelle et al., 2019; Tymoshenko et al., 2015; Wang et al., 2018). In an analysis of metabolic tasks, we define a set of input molecules (extracellular nutrients or intracellular precursors) and expected output molecules (biomass precursors or expected end products of a metabolic pathway). Next, we evaluate whether the task is feasible. A task is feasible when it is possible to produce all output molecules using the input molecules. We next evaluate what metabolic pathway was used and which genes are essential to fulfill a task.
- f.
  Media layer. Analysis of in silico minimal media allows to study nutritional requirements, evaluate substrate substitutability, and identify genes that become essential upon substrate inaccessibility. The minimum sets of substrates that rescue essentiality when added to an in silico minimal medium are bottleneck substrates (Stanway et al., 2019). The media layer study comprises a systematic analysis of in silico minimal media, essentiality at each minimal medium, and identification of bottleneck substrates.
  Note: media analysis with PhenoMapping is especially useful in organisms for which the growing conditions (media composition) and nutritional requirements is uncertain. This is the case in intracellular parasites (Chiappino-Pepe et al., 2017; Krishnan et al., 2020; Stanway et al., 2019; Tymoshenko et al., 2015).
- g.
  Metabolomics layer. Thermodynamics-based flux analysis will pinpoint genes that become essential due to a set of reaction directionalities imposed by thermodynamic constraints. It is possible to identify sets of metabolites whose concentration ranges determine such reaction directionalities and these are called bottleneck metabolites (Chiappino-Pepe et al., 2017). The metabolomics layer study involves a systematic integration of metabolomics data within the TFA framework (Salvy et al., 2018), thermodynamically consistent essentiality analysis with or without metabolomics, and identification of bottleneck metabolites.
- h.
  Transcriptomics layer. Integrative analysis of RNA-seq data helps to identify genes that become essential due to gene expression constraints. TEX-FBA (Pandey et al., 2019) will try to maximize consistency between RNA levels and metabolic reaction fluxes. There will be three classes of genes: highly, medium, and lowly expressed (defined by TEX-FBA parameters; Troubleshooting 8). For reactions linked to highly expressed genes, TEX-FBA tries to increase metabolic flux. For reactions uniquely linked to lowly expressed genes, TEX-FBA tries to minimize metabolic flux. The maximum number of such type of agreements counts for a consistency score. Reaction fluxes linked to genes with medium expression are free to vary. PhenoMapping uses TEX-FBA to integrate RNA-seq data and calculate a maximum consistency score. It next performs essentiality analysis at the maximum consistency score and identifies the metabolic fluxes that are responsible for a gene essentiality, also called bottleneck reactions (Stanway et al., 2019).
- i.
  Regulation layer. Transcriptomics data analysis also allows identifying isoenzymes that become essential due to lack of transcriptional regulation of counterpart isoenzymes or bottleneck isoenzymes. PhenoMapping regulation analysis include a systematic integration of transcriptomics data within the TEX-FBA framework (Pandey et al., 2019), essentiality analysis with transcriptomics considering lack of regulation between isoenzymes, and identification of bottleneck isoenzymes.

Layers of information or physico-chemical constraint types in a GEM and their hierarchy as suggested within PhenoMapping

Metabolic model contextualization

Timing: 10–30 min

The second step in the PhenoMapping workflow (Figure 2) is a setup step and involves the contextualization of the GEM as designed in the first step. Predictions from a GEM are the consequence of the biochemical information and physico-chemical constraints integrated into the GEM (Figure 3). Here, we define the protocol to define each layer of information in the GEM. We assume that the initial GEM already includes all the corresponding organism-specific information and putatively context-specific information like a medium composition.

3.
Select one step between the followings to perform a PhenoMapping analysis at each iteration.
- a.
  Generate a GEM with the ground biochemistry layer.
  - ▪
    Remove all constraints related to omics data integrated into the GEM.
  - ▪
    Define a rich medium and allow uptake and secretion of all metabolites in the medium.
  - ▪
    Remove ad hoc reaction (intracellular reactions and transports) directionalities in the GEM.
  - ▪
    If applicable (eukaryotic organism), remove compartmentalization. This is done by defining all reactions in the cytosol or by allowing all metabolites to be transported and present in all intracellular compartments.
- b.
  Generate a GEM with localization layer (in eukaryotes).
  - ▪
    Remove all constraints related to omics data integrated into the GEM.
  - ▪
    Define a rich medium and allow uptake and secretion of all metabolites in the medium.
  - ▪
    Remove ad hoc reaction (intracellular reactions and transports) directionalities in the GEM.
  - ▪
    If applicable (eukaryotic organism), allow all metabolites without a phosphate, acyl-carrier protein (ACP), and CoA moiety to be transported between cytosol and other compartments. If there is experimental evidence about a transporter for a metabolite with a phosphate, acyl-carrier protein, and CoA moiety, this should be allowed.
- c.
  Generate a GEM with intracellular transportability layer (in eukaryotes).
  - ▪
    Remove all constraints related to omics data integrated into the GEM.
  - ▪
    Define a rich medium and allow uptake and secretion of all metabolites in the medium.
  - ▪
    Remove ad hoc directionalities only for intracellular reactions in the GEM.
    Note: do not modify the directionalities of intracellular metabolite transportability from the initial GEM.
- d.
  Generate a GEM with enzymatic irreversibility layer.
  - ▪
    Remove all constraints related to omics data integrated into the GEM.
  - ▪
    Define a rich medium and allow uptake and secretion of all metabolites in the medium.
- e.
  Prepare a GEM for metabolic tasks analysis.
  - ▪
    Decide whether an analysis with or without thermodynamic constraints should be performed.
  - ▪
    Select the sets of metabolites whose production you want to test. By default, this will be all biomass building blocks.
  - ▪
    Define an essentiality threshold (see section Essentiality definition).
- f.
  Generate a thermodynamically curated GEM for subsequent context-specific PhenoMapping analysis.
  - ▪
    Remove all constraints related to omics data integrated into the GEM.
  - ▪
    Define a rich medium and allow uptake and secretion of all metabolites in the medium.
  - ▪
    Define thermodynamically relevant information for each intracellular compartment: pH, generic metabolite concentrations (minimum and maximum allowed values), generic ionic strength (unique value), membrane potential.
  - ▪
    Curate the GEM thermodynamically within the TFA framework.
- g.
  Generate a GEM for a media analysis.
  - ▪
    Select whether analysis of uptakes or secretions should be performed.
  - ▪
    Select the sets of transports of extracellular metabolites among which the minimal uptake or secretion analysis will be performed; Troubleshooting 4. By default, all substrates in the media will be selected. We indicate below two types of analysis (targeted or untargeted) that one can perform depending on the substrates made available and the uptakes selected for media analysis.
    CRITICAL: it is important to note that the algorithm will not unblock uptakes or secretions in the GEM. If uptakes and secretions were blocked in the GEM input to the PhenoMapping media analysis, they will remain blocked in the media analysis. Hence, it is important to properly define a media composition before one begins the PhenoMapping analysis (see section Medium definition).
    
    Recommended: For an untargeted analysis of in silico minimal media, one should have defined a rich medium in the GEM (see the Medium definition section; Troubleshooting 4). At this stage, all uptakes should be selected for the media analysis. Following this setup, one will identify all alternative minimal sets of molecules required for in silico growth in the correct combination. It was shown before (Chiappino-Pepe et al., 2017) that such an analysis provides further understanding of the molecular substructures or backbone moieties that a cell needs to scavenge. The requirement to scavenge such moieties occurs when the biochemical information and further physio-chemical constraints defined in the GEM (that probably represent the metabolic function of the organism) do not allow the biosynthesis of such backbone moieties.
    
    Alternatives: For a targeted analysis of in silico minimal media, one should have defined a specific medium in the GEM before beginning the PhenoMapping analysis. In this medium the GEM should be feasible. At this stage, one defines the subset of substrates of interest for the media analysis. This analysis identifies within the subset of substrates, the minimum number of substrates required to achieve a minimum value of the objective (e.g., growth).
  - ▪
    Perform the in silico minimal medium analysis.
  - ▪
    Define in the GEM a combined minimal medium comprising all substrates identified across all alternative in silico minimal media. We use minimal media alternatives that contain the same number of substrates.
    CRITICAL: defining here a combined minimal medium simplifies the process to infer the medium of a context-specific GEM. Check the section Quantification and statistical analysis (Results of a PhenoMapping analysis of bottleneck substrates) for more details on the importance of this last step to optimally guide the definition of the media in the context-specific GEM.
- h.
  Generate a GEM for a metabolomics analysis.
  - ▪
    Integrate the metabolomics dataset into the GEM.
  - ▪
    Verify that the GEM is feasible within TFA when metabolomics data are integrated; Troubleshooting 9.
- i.
  Generate a GEM for a transcriptomics and regulation analysis.
  - ▪
    Decide whether or not to plot the distribution of gene expression values.
  - ▪
    Select the TEX-FBA parameters defining the percentile of lowly and highly expressed genes in the distribution of gene expression values; Troubleshooting 8.
  - ▪
    Select the TEX-FBA parameters defining the bounds assigned to lowly and highly expressed reactions; Troubleshooting 8.
  - ▪
    Select the reactions for which gene expression constraints should not be defined
  - ▪
    Decide which transcriptomics profile an output GEM should include.
    Note: upon integration of transcriptomics data, TEX-FBA will identify all alternative transcriptomic profiles that render a maximum consistency score between gene and reaction levels. One can select one specific transcriptomic profile for the subsequent analysis. Alternatively, one can also select a combined expression profile, which will account uniquely for the expression constraints common to all transcriptomic profiles.
  - ▪
    Integrate the transcriptomics dataset into the GEM.
- j.
  Define the following inputs common to any context-specific PhenoMapping analysis.
  - ▪
    Remove all constraints related to omics data integrated into the GEM. This is not necessary if one uses a generic GEM.
  - ▪
    Define the expected value of the selected objective function (normally growth) at the conditions to study.
  - ▪
    Define the selected essentiality threshold (see section Genetic background and essentiality definition).
    Note: a value of the selected objective function and essentiality threshold will be used to identify a minimum required objective value. For example, in the media analysis, we first identify the in silico minimal media or the minimum number of substrates required to achieve at least a required value of the objective. Such value is given by the input values of the objective function and essentiality threshold.
  - ▪
    Select a time limit (in seconds) for the CPLEX solver. By default, none.
  - ▪
    Define whether one wants to identify uniquely alternatives for the optimal value of the objective function (preferred). Alternatively, one can look for suboptimal solutions. For example, one can find an in silico minimal media with 19 substrates and identify all alternative combinations of 19 substrates that allow growth. One can also identify alternative combinations with 20 or more substrates.
  - ▪
    Select a maximum number of alternative solutions to obtain. This is applicable every time a mixed integer formulation is defined. For example, for the identification of alternative in silico minimal media and alternative bottleneck substrates.
    Note: it is preferred to select a high number of alternatives like 5,000; check Troubleshooting 10 for suggestions when the optimization crashes or the number of alternatives selected was not enough.

Essentiality prediction

Timing: 1–10 min

The third step in an analysis step and involves a prediction of gene essentiality with the contextualized GEM. If one performs a context-specific PhenoMapping analysis, the set of essential genes will be compared with the unconditionally essential genes or genes predicted as essential with the general GEM input to the PhenoMapping workflow. If one performs an organism-specific PhenoMapping analysis, one might compare the set of essential genes with the ones obtained in the immediate previous layer of information. Predictions of gene essentiality are expected to vary between the contextualized GEMs. Here, we define the suggested steps to identify essential genes in any GEM within PhenoMapping.

Note: check Troubleshooting 11 if infeasibilities arise when calculating essentialities.

Optional: In a transcriptomic analysis, one can perform two types of essentiality analysis consistent with transcriptomics: (a) fix a unique transcriptomic profile; this is done by fixing the integer variables linked to all up and down reaction levels; (b) fix an assembly of transcriptomic profiles that satisfy a maximum consistency score. The maximum consistency score is a variable within TEX-FBA and defines the number of reactions that can carry low or high fluxes and are consistent with the classification of lowly and highly expressed genes, respectively. There may exist multiple alternative transcriptomic profiles that share the same maximum consistency score. These alternatives are an assembly of transcriptomic profiles. To fix such an assembly, we define the lower bound of the maximum consistency score with a value that is some decimals below the optimal objective value. This is to avoid problems with the precision of the solver; Troubleshooting 12. To provide some flexibility around the transcriptomic profiles, one can further relax the lower bound of the maximum consistency score by some integers. This option-b only differs from option-a if there is more than one alternative transcriptomic profile for the maximum consistency score within TEX-FBA.

4.
Define the objective function chosen for the GEM; Troubleshooting 5.
5.
Double check that the contextualized GEM is feasible; Troubleshooting 6.
6.
Perform the in silico essentiality study.

Optional: An essentiality analysis per growth associated metabolic task can be performed. This will identify which biomass building block is responsible for the observed essentiality (Chiappino-Pepe et al., 2017) (Figure 4).

Schema of growth simulation using a GEM

(A)The GEM uses a set of substrates or nutrients to produce molecules required for growth or biomass building blocks in the stoichiometrically required amounts (η_i). Biomass building blocks are monomers of macromolecules required for the cellular function. The stoichiometric coefficients of the biomass building blocks satisfy the concentration of macromolecules in the cell (v, w, x, y, z).

(B) Predicted growth upon single knockout of each gene in the blood-stage-specific *P. berghei* GEM (iPbe-blood). We classify genes based on the essentiality threshold (dashed line bottom) and growth reduction threshold (dashed line top) into essential, growth reducing, and dispensable. The essentiality threshold (here 10%) and growth reducing threshold (here 90%) define which genes are essential and growth reducing, respectively, based on the predicted growth upon knockout (KO growth) compared to the predicted wild-type growth (WT growth). iPbe-blood predicts 146 essential genes, 9 growth reducing genes, and 273 dispensable genes (with solver version CPLEX 12.8.1 or above). Acronyms: AAs, amino acids; PG, phosphatidylglycerol; PE, phosphatidylethanolamine; PS, phosphatidylserine; PC, phosphatidylcholine; free FA, free fatty acids; TAG, triacylglycerol; DAG, diacylglycerol.

7.
Select an output to identify essential genes:
- a.
  Ratio of optimal value of the objective function between the input (normally wild-type) GEM and the single gene knockout GEM.
- b.
  Absolute value of the objective function in the single gene knockout GEM.

Note: both approaches if compared with the corresponding reference values (as described in the next point) ultimately result in the same set of essential genes. However, selecting the absolute value of the objective function to identify essential genes allows to define an arbitrary value of the objective function as reference. This latter option might be more appropriate in a situation of high uptake rates (very unconstrained model) and high growth.

8.
Identify all knockouts rendering a value below the essentiality threshold chosen (Figure 4)
- a.
  If the ratio is below the essentiality threshold the gene is considered as essential.
- b.
  If the predicted value of the objective function is below the product of the essentiality threshold and the initial value of the objective function the gene is considered as essential.

Note: by default, in PhenoMapping all infeasible solutions (NaN) upon a gene knockout consider the gene essential. However, a lack of convergence in the optimization and problems with the solver might also render infeasible solutions; Troubleshooting 12.

Accuracy assessment

Timing: 1–5 min

The fourth step is an analysis step and involves an accuracy assessment of the gene essentiality prediction with the contextualized GEM. This step is possible when there are phenotypic data available in the PhenoMapping workflow (Figure 2). In this step, a contingency matrix is generated (Figure 5) and the set of correct and incorrect predictions is identified.

9.
Compare the list of in silico essential and non-essential genes with the experimentally observed (in vivo) phenotypes.
10.
Classify all compared genes in the GEM in four groups: TPs, TNs, FPs, and FNs.

Note: if “slow” phenotypes are available, one should decide how to treat them, i.e., as essential, or dispensable. This decision might be determined by the layer of information analyzed within PhenoMapping. For example, during a PhenoMapping analysis of the biochemistry layer, slow phenotypes might be better considered as dispensable. This is because one expects that slow phenotypes arise due to the presence of a redundant and non-optimal function that can partially compensate for the loss of the slow-phenotype gene. However, during a PhenoMapping analysis of the transcriptomics layer, slow phenotypes might be well considered as essential. This is because a GEM with transcriptomics data integrated identifies genes that are essential to maintain the defined (optimal) transcriptomic state. Hence, knocking out such a gene might render a transition to a different (suboptimal) transcriptomic or physiological state.

Optional: one can also add a classification for genes without data, blocked, or with slow phenotypes (Figure 5). The genes classified as blocked and without data are not considered for the computation of the accuracy. The treatment of the slow phenotypes may vary depending on the context of the GEM.

Note: Blocked genes are genes linked to reactions that cannot carry any flux in the reference conditions, also called blocked reactions. This occurs when any of the metabolites participating in the linked reactions cannot be mass balanced. Blocked reactions are identified with a flux variability analysis (Mahadevan and Schilling, 2003). We recommend performing a flux variability analysis in the generic GEM (in a rich medium and without any data integrated) and without any growth requirement. That way, there are no conditional or context-specific constraints leading to the non-function of the gene.

11.
Generate a contingency matrix by defining the number of TPs, TNs, FPs, and FNs.
12.
Compute the selected metric to assess the accuracy using the numbers defined in the contingency matrix.

Note: there might be situations in which the number of available in vivo phenotypes is very low compared to the number of genes in the GEM or with in silico phenotypes. Such cases decrease the confidence on the accuracy of the model. Although there is little that a user can do to improve such a situation regarding the availability of phenotypes, the user can choose how to evaluate the FPs and FNs to better assess the accuracy of the GEM’s predictions. FNs arise when the model misses biochemical information or gene annotations or has incorrectly defined constraints. FPs arise when the model misses the definition of a context or constraints. As previously mentioned in this protocol, one can argue that (if the data are fully trustable) having more FNs than FPs in a GEM is worse than having more FPs than FNs. This is because one does not want to include false constraints into the GEM. In many situations a single constraint in the GEM (as identified with PhenoMapping) can be responsible for the essentiality of a FN and a TN. In such a case, we recommend not blindly integrating such constraint to increase the TNs (since that will also increase the FNs). We recommend first adding missing information into the GEM like new biochemistry or gene annotations such that later the named constraint becomes responsible uniquely for the essentiality of the TN. This means we recommend focusing first on correcting or reducing FNs (increasing TPs) and then on reducing FPs (increasing TNs) using PhenoMapping. See the follow-up discussion in the section Quantification and statistical analysis. The fact that PhenoMapping maps phenotypes to constraints may also increase the confidence on the prediction of genes without phenotype. If a constraint is responsible for one or more TNs and a gene for which no in vivo phenotype is available, we might feel more confident on the essentiality of the gene – primarily if the TNs and the gene without phenotype share metabolic pathways or tasks.

Contingency matrix for the blood-stage-specific *P. berghei* GEM (iPbe-blood) compared to the blood-stage-specific PlasmoGEM phenotypes

The accuracy values for this contingency matrix are: MCC = 0.63, overall accuracy = 0.79, NPR = 0.96, PPR = 0.67, sensitivity = 0.96, and specificity = 0.67.

Bottleneck identification

Timing: 5–60 min

The fifth and last step of the PhenoMapping workflow is an analysis step and involves the mapping of bottlenecks to phenotypes. After new genes are identified as essential in the contextualized GEM, PhenoMapping will identify the bottlenecks or underlying cellular processes responsible for that essentiality. This is done by performing one-by-one a knockout of the essential genes in the contextualized GEM and identifying the conditions that rescue growth (Figure 6). Here, we define the steps to identify bottlenecks as followed within the systematic bottleneck analysis of each layer of information.

13.
Knockout the essential gene in the contextualized GEM.
14.
Identify all alternative bottlenecks or the minimum set of information (e.g., substrates, metabolite concentrations, reaction levels) that should be relaxed to rescue the gene knockout (Figure 6).

Note: bottleneck substrates are those that can rescue essentiality of the gene when added to the in silico minimal medium. Bottleneck metabolites are those whose concentrations ranges should be relaxed (with respect to the experimentally measured concentration ranges) to rescue essentiality of the gene. Bottleneck reactions are those whose levels (considered to be high or low within the feasible flux range) should be relaxed to rescue essentiality of the gene.

Representation of bottlenecks studies in PhenoMapping

(A–C) (A) Bottleneck substrates, (B) bottleneck metabolites, and (C) bottleneck reaction levels. PhenoMapping first simulates with the GEM some conditions (here: (A) *in silico* minimal media, (B) metabolomics data integrated, and (C) transcriptomics data integrated) and identifies *in silico* a phenotype (here single gene essentiality for growth). Next, PhenoMapping looks for the bottlenecks responsible for the predicted phenotype (here: (A) missing substrates in the media, (B) sets of metabolite concentration ranges, and (C) sets of levels of reaction fluxes and their corresponding RNA levels). The color code is consistent with the related step in the main PhenoMapping workflow (Figure 2).

Metabolic model curation

Timing: 1–60 days

Note: the timing to curate a GEM varies radically depending on the available GEM and data and the experience and endurance of the user. Many aspects of the GEM can require curation. The problems in a GEM can range from being badly elementally balanced to missing a considerable amount of gene annotations and associated reactions; see Troubleshooting 2 to identify all elements that an ideal GEM may include. This section aims to define a pipeline to spot and solve those problems faster.

This is a step that combines both setup and analysis steps. We curate a metabolic model when we change the biological and biochemical information it contains. Such information involves genes, gene functions, protein associations (protein complexes or isoenzymes), biochemical reactions, transporters, and biomass building blocks. Since GEMs are normally constructed following a bottom-up approach, it is more likely that a metabolic model curation involves adding missing information. However, curation of a GEM might also involve removing incorrectly defined ad hoc constraints.

The curation of the GEM is an optional step in the iterative PhenoMapping workflow (Figure 2). The information achieved by mapping in silico bottlenecks to phenotypes facilitates and accelerates the identification of missing biological and biochemical information in the GEM, as well as incorrectly defined ad hoc constraints.

A suggested workflow to curate the GEM using PhenoMapping is defined in Figure 7. This workflow is primarily manual. Here, we suggest conceptually how to perform a GEM curation in combination with the main PhenoMapping workflow (Figure 2). We propose analyzing one-by-one the incorrect gene predictions: first the FNs and then the FPs. One may follow the steps below in the order defined and select the steps depending on the type of inconsistency (FN or FP) for the gene of study. If the GEM is modified, one may perform a new essentiality analysis and accuracy assessment to evaluate the impact of the curation on the GEM performance.

Note: Adding information around true predictions may prevent mismatches in a more constrained scenario. For instance, one needs to identify isoenzymes linked to a reaction (even if the reaction is true positive) to prevent it from being false negative in a constrained scenario. When one supposedly has all information mapped to the GEM, we may wait until a false prediction arises to introduce corrective measures. False predictions arise normally in the PhenoMapping analysis with layers of information that are hierarchically higher. Alternatively, we may perform an unbiased integration of alternative information around true predictions and screen the performance of the model in a more constrained scenario for selection of the best corrective measure. This later option is not discussed here.

Suggested workflow to curate a metabolic model in combination with the PhenoMapping workflow

This workflow identifies missing biological and biochemical information in a GEM.

(A) The workflow to curate false negatives (FNs) involves four steps.

(B) The workflow to curate false positives (FPs) includes three steps. The curation of a GEM requires collecting and using different types of datasets. The color code is consistent with the related step in the main PhenoMapping workflow (Figure 2).

In this section, we do not consider the integration of context-specific information (like definition of uptake rates and integration of metabolomics and transcriptomics data) as part of the metabolic model curation. We consider that the integration of context-specific information is part of the metabolic model contextualization. The section Quantification and statistical analysis explains how to contextualize a GEM based on the bottleneck information from a context-specific PhenoMapping analysis.

Note: curating a metabolic model can be a daunting and time-consuming task. Be patient, do sports, eat healthy, and talk with friends and family to remain mentally ok.

15.
(FN) Perform an essentiality analysis per metabolic task.
- a.
  If there is a metabolic task uniquely responsible for a set of FNs and no true prediction, remove the biomass building block from the biomass reaction.
- b.
  If a biomass building block was removed, update the stoichiometric coefficients of the remaining biomass building blocks accordingly (Chan et al., 2017).
16.
(FN) Perform a reannotation of the genome defining more relaxed parameters, e.g., E-values, or look in databases for genes with the same function that are not part of the GEM.
- a.
  If there exists a potential gene with the same function, add the gene to the GEM with an OR relation in the gene-protein-reaction association.

17.
(FN) Perform a gap-filling with the gene knocked out to identify missing alternative biochemistry in the GEM.
- a.
  Select a proper database to look for alternative biochemistry. We distinguish three classes of databases: (1) GEMs of closely related organisms, which can be found in databases for GEMs like BiGG (King et al., 2015), modelSEED (Devoid et al., 2013), KBase (US Department of Energy Systems Biology Knowledgebase, http://kbase.us), publications, etc.; (2) databases of biological reactions like KEGG (Kyoto University, 1995), MetaCyc (Caspi et al., 2018), BRENDA (Jeske et al., 2019), etc.; (3) the upper bound of biochemistry with hypothetical biochemical reactions between known compounds based on known enzyme reaction rules, i.e., the ATLAS of Biochemistry (Hadadi et al., 2016; Hafner et al., 2020).
  CRITICAL: the compatibility of metabolite identifiers between the GEM and the database plays a critical role in the selection of the database. Metabolite identifiers need to match to assure the proper connectivity of the metabolic networks of the GEM and database. It is also important to consider which version of the database to use. We would recommend working with the latest version, but that might create conflicts with metabolite identifiers or other identifiers like genes. For this reason, the user might consider working with an earlier version.
- b.
  Identify a gap-filler that suits the GEM, database, computational power available, and desired gap-filling strategy.
  - ▪
    There exist multiple examples of gap-fillers, as summarized before (Pan and Reed, 2018). Some recent examples are: gapseq (Zimmermann et al., 2020) or OptFill (Schroeder and Saha, 2020).
- c.
  If there exists an alternative biochemistry that rescues the knockout, integrate it into the GEM.

18.
(FN) Investigate the possibility of a metabolite in the GEM being scavenged to rescue the KO.
- a.
  If there is evidence that the metabolite selected might be available at the cellular state studied, and the transport of such metabolite is possible (by any transport mechanism), define the transport in the GEM.
19.
(FP) Search for a missing metabolic task downstream of the FP gene.
- a.
  If there is a downstream product that could be a biomass precursor and its definition as a metabolic task does not create inconsistencies, add it to the biomass reaction.
- b.
  If a biomass building block was added, update the stoichiometric coefficients of the remaining biomass building blocks accordingly (Chan et al., 2017).
20.
(FP) Perform a flux variability analysis with a non-zero lower bound for the objective function.
- a.
  If the reactions linked to the gene cannot carry flux, perform a gap-filling (next step).
21.
(FP) Perform a gap-filling with the objective function redefined to require flux through the FP gene.
- a.
  Select a proper database to search for missing biochemistry.
- b.
  Identify a gap-filler that suits the GEM, database, computational power available, and desired gap-filling strategy.
  - ▪
    There exist multiple examples of gap-fillers, as summarized before (Pan and Reed, 2018). Some recent examples are: gapseq (Chan et al., 2017) or OptFill (Schroeder and Saha, 2020).
- c.
  If there exist biochemical steps that can connect the metabolic network defined in the GEM with the FP gene, define it.

Note: The order in which these steps are applied affects the definition of the GEM and later the identification of bottlenecks. We recommend following the order of steps defined in this section. The steps to curate FNs and FPs are defined in this order to identify first issues on which the user has more confidence. For example, to curate FNs we first check if there is an error in the objective function (a fact). Then we evaluate whether a gene is missing from the GEM (an E-value defines the confidence of the annotation). If no gene is found, we perform a gap-filling (a hypothetical non-annotated biochemical function). If no gap-filling reaction is found, we allow uptake of a metabolite (GEMs show the highest uncertainty in the definition of metabolite transports).

The PhenoMapping workflow is often iterative

Timing: variable

One would perform as many passages through the PhenoMapping workflow as layers of information one desires to analyze (feedback in Figure 2). The layers of information can be analyzed independently or in a cumulative fashion. An independent analysis is recommended for the first PhenoMapping iterations and in a generic GEM. A cumulative analysis is recommended after an independent analysis and in a context-specific GEM. The order of the analysis would be hierarchical as suggested in Figure 3 and argued in the section PhenoMapping study design. In addition, more iterations through the PhenoMapping workflow might be required when a curation of the GEM is selected.

22.
Perform an independent, individual, and separate analysis of each layer of context-specific information to identify individual bottlenecks responsible for phenotypes. In an independent analysis of constraint types, a generic GEM is used to integrate each dataset independently, identify new essential genes, and map gene essentiality to bottlenecks.
23.
Perform also a cumulative and hierarchical integration of constraint types (Figure 3). In such a cumulative analysis, new essentialities might be identified at each integration step and can be mapped to sets of constraints within the last layer of information considered. The independent analysis of constraint types may limit the set of predicted phenotypes. A cellular phenotype is the product of a cumulative rather than individual effect of physico-chemical constraints. Hence, a cumulative analysis is recommended to analyze a context-specific GEM.

Expected outcomes

Here, we present an example to aid in the definition of a PhenoMapping analysis and understanding of its output. We use the example of blood-stage and liver-stage P. berghei to illustrate the iterative workflow of PhenoMapping (Figure 8). The same analysis was performed for tachyzoite T. gondii. The metabolic models of P. berghei (iPbe) and T. gondii (iTgo), the P. berghei blood- and liver-stage relative growth phenotypes, the tachizoyte T. gondii genome-wide screen, and all integrated omics data are available in www.github.com/EPFL-LCSB/phenomapping. The scripts with all input values used for the analyses are also available in the GitHub repository. These files are enough to enable the user to follow along and repeat all computational steps of these examples with PhenoMapping. We present the results obtained for both examples. Additional analyses of the outputs were done as explained in the following section.

Schema of preparatory steps as applied to study blood and liver-stage phenotypes with iPbe and PhenoMapping

The preparatory steps are shown in Figure 1. Color code is consistent with related steps in the main PhenoMapping workflow (Figure 2).

Preparatory steps for a PhenoMapping analysis of genome-wide blood and liver-stage phenotypes in P. berghei

All preparatory steps for the PhenoMapping analysis of iPbe were performed as follows (Figure 8), and the result (models and data) are available in www.github.com/EPFL-LCSB/phenomapping.