Multi-omic integration by machine learning (MIMaL)

Quinn Dickinson; Andreas Aufschnaiter; Martin Ott; Jesse G Meyer

doi:10.1093/bioinformatics/btac631

. 2022 Sep 15;38(21):4908–4918. doi: 10.1093/bioinformatics/btac631

Multi-omic integration by machine learning (MIMaL)

Quinn Dickinson ^1,², Andreas Aufschnaiter ³, Martin Ott ^4,⁵, Jesse G Meyer ^6,^7,^✉

Editor: Pier Luigi Martelli

PMCID: PMC9801967 PMID: 36106996

Abstract

Motivation

Cells respond to environments by regulating gene expression to exploit resources optimally. Recent advances in technologies allow for measuring the abundances of RNA, proteins, lipids and metabolites. These highly complex datasets reflect the states of the different layers in a biological system. Multi-omics is the integration of these disparate methods and data to gain a clearer picture of the biological state. Multi-omic studies of the proteome and metabolome are becoming more common as mass spectrometry technology continues to be democratized. However, knowledge extraction through the integration of these data remains challenging.

Results

Connections between molecules in different omic layers were discovered through a combination of machine learning and model interpretation. Discovered connections reflected protein control (ProC) over metabolites. Proteins discovered to control citrate were mapped onto known genetic and metabolic networks, revealing that these protein regulators are novel. Further, clustering the magnitudes of ProC over all metabolites enabled the prediction of five gene functions, each of which was validated experimentally. Two uncharacterized genes, YJR120W and YDL157C, were accurately predicted to modulate mitochondrial translation. Functions for three incompletely characterized genes were also predicted and validated, including SDH9, ISC1 and FMP52. A website enables results exploration and also MIMaL analysis of user-supplied multi-omic data.

Availability and implementation

The website for MIMaL is at https://mimal.app. Code for the website is at https://github.com/qdickinson/mimal-website. Code to implement MIMaL is at https://github.com/jessegmeyerlab/MIMaL.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

There are various methods to integrate multi-omic datasets, reviewed in the context of single-cell data in Miao et al. (2021). Multi-omic integration strategies currently are employed within three general disciplines: (i) disease subtyping, especially in the context of cancer heterogeneity, (ii) biomarker discovery and (iii) discovery of biological insights (Subramanian et al., 2020). In the context of biological insights, multi-omics integration has been accomplished using several statistical approaches, such as Bayesian, exemplified by PARADIGM (Vaske et al., 2010) and iClusterPlus (Shen et al., 2009), or correlation-based approaches such as CNAmet (Louhimo and Hautaniemi, 2011). These approaches have uncovered pathways involved in cancer prognosis (Vaske et al., 2010), drug selectivity of cancer lines (Mo et al., 2013) and novel candidate oncogenes (Louhimo and Hautaniemi, 2011). However, most existing multi-omic data integration methods are not able to infer new biological interactions between layers of multi-omic data, and the methods that do look for connections between layers often look at 1:1 connections based on simple linear correlation. Due to complex biological regulation balancing many processes, many interesting connections between omic layers are unlikely to have 1:1 relationships. There is a need for new strategies that leverage the interactions between omics layers to discover non-linear relationships and produce more knowledge than the sum of the two datasets.

Machine learning is a promising approach for discovering relationships between datasets. Machine learning techniques have enabled successful integration of multi-omic datasets (Krassowski et al., 2020). Some examples of this include supervised methods predicting cancer prognosis (Chai, 2021), cellular state in Escherichia coli (Kim et al., 2016), patient survival outcomes for cancer types (Wilson et al., 2019) or patient drug response (Sharifi-Noghabi et al., 2019). Unsupervised methods have also been developed for the discovery of biomarkers (Singh et al., 2019) and the subtyping of cancers (Ronen et al., 2019). Each of these approaches relies on an early, intermediate or late integration strategy, as described in Picard et al. (2021). The integration of multi-omic data through hierarchical prediction between omic layers is relatively unexplored, though at least one previous paper had described the prediction of metabolomic changes from proteomic changes (Zelezniak et al., 2018).

Here, we establish multi-omic integration using a tree-based regression model trained to predict metabolite changes from proteomic changes (Fig. 1A). This allowed us to reveal new connections between proteins and metabolites using SHAP (Lundberg and Lee, 2017), a machine learning model interpretation method. SHAP uses game theory to interpret any model and discover complex relationships between inputs and outputs. A key feature of SHAP is that it interprets each example input separately in comparison to other methods that usually compute feature importance for the whole dataset. New connections between proteins and metabolites inferred from SHAP were experimentally verified to represent the amount of control a protein’s quantity exerts over a given metabolite. Many of these protein–metabolite connections are distant, based on known genetic and metabolic interactions. Finally, summarizing the strength of these protein control (ProC) values across all metabolites reveals new connections between experimental conditions. In the case where conditions are single gene knockouts, this clustering reveals new functions of both characterized and uncharacterized mitochondrial proteins.

Fig. 1. — MIMaL workflow, model interpretation and demonstration of biological applicability. (A) MIMAL is a multi-omic integration method utilizing machine learning model interpretation with cluster analysis to uncover unknown relationships between samples. (B) Comparison of the model performance with average mean squared error across five folds from 5-fold cross-validation. ExtraTrees was selected for further analysis due to performance and specialized interpretation algorithms for decision tree-based methods. (C) Performance of ExtraTrees models in predicting fold change in each metabolite from proteomic data, measured by R² between predicted and experimental metabolite values for each held out test set. (D) Example of true versus predicted quantity of one metabolite, citric acid, with each point representing one sample, i.e. knockout strain under fermentation or respiration conditions. (E) SHAP forceplot for MEF1delta under respiration conditions where red and blue bars represent protein quantities that increase or decrease the prediction value of citric acid relative to the baseline, respectively. (F) Quantification of citric acid in strains selected from SHAP analysis. Strains were grown under respiration conditions, metabolites were extracted in methanol, and citric acid quantities were measured using targeted MS/MS. Citrate quantities reflect predictions made from SHAP

2 Materials and methods

2.1 Yeast protein–metabolite data imputation

A total of 873 proteins were measured in all samples. Missing protein values were imputed using the sklearn (Pedregosa et al., 2011) (v1.1.1) function KNNImputer with setting n_neighbors = 2, resulting in all 3690 protein quantities being used as input for the modeling task. Metabolite data were imputed using the same setting, producing 273 complete metabolite columns.

2.2 Machine learning and model optimization

The data were split into 313 random examples for training and 35 examples for testing. This split ratio of 90/10 was chosen arbitrarily based on the ability to have over 300 training examples to learn from while still having a good number of 35 held-out test examples. Multiple types of models were first tested by 5-fold cross-validation with the default parameters, and the average mean squared error (MSE) across the five folds were compared (see jupyter notebook on github ‘compare-models.ipynb’). Tested models were implemented in sklearn including a dummyRegressor baseline, LinearRegression, Lasso, ElasticNet, Ridge, support vector regression wrapped in MultiOutputRegressor, AdaBoost (Schapire, 2013) wrapped in MultiOutputRegressor with 500 estimators, GradientBoostingRegressor with 500 estimators wrapped in MultiOutputRegressor, ExtraTreesRegressor with 500 estimators and RandomForestRegressor with 500 estimators (Pedregosa et al., 2011). All of these models except the dummy, elasticNet and Lasso performed similarly according to the metric MSE; we selected extraTreesRegression because we wanted the interpretability of a tree model and the speed of training extraTrees.

One multi-output regression extra trees model was optimized using 5-fold cross-validation with the 313 training examples by gridsearch (see jupyter notebook on github ‘extratrees-gridsearch.ipynb’) with the following parameters: ‘max_depth’: [10,30, 50, 70, None], ‘min_samples_leaf’: [1,2,5], ‘min_samples_split’:[2,5,10], ‘max_features’: ['log2', ‘auto’, ‘sqrt’], ‘n_estimators’: [500, 1000, 1500]. The best model parameters for the polar metabolomics model used all the default parameters except max_depth = 50 and n_estimators = 500. Those parameters were then used to train a single output extraTrees model for each of the 273 polar metabolites. The trained model was used to make predictions on the 35 examples in the test set, and those true and predicted values were used to compute regression metrics. The R2_score and mean_square_error functions in sklearn summarized performance across all the metabolites. Note that the sklearn implementation of the coefficient of determination can be negative, which would indicate that the model is arbitrarily worse than a model that predicts the average quantity of y. That is, the value is not the square of R. See more information on the method’s information page here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html.

2.3 Yeast protein–metabolite SHAP analysis

SHAP values were calculated for each knockout for each metabolite model using TreeExplainer method in the python package SHAP (Lundberg and Lee, 2017) (v0.39.0). Only identified metabolites that had a positive R² score comparing the true versus predicted quantity were included in subsequent analysis. This excludes roughly 200 additional unidentified metabolites.

Correlations between each protein quantity across all single knockout samples were calculated using Spearman’s rho and significance was adjusted using Bonferroni Correction. For citric acid, the top 20 mean magnitude SHAP contributor proteins were chosen for further analysis. A network was created with citric acid as the central node, linked to each SHAP contributor protein. Each SHAP contributor protein was then linked to each correlated protein, where correlated proteins were defined as Bonferroni adjusted P-value < 0.05 and a |ρ| > 0.7 from Spearman rank correlation analysis. Enrichment analysis was performed for biological function GO terms (version 2021-07-02) using ClueGO (Bindea et al., 2009) (v2.5.8) on each group of SHAP contributor proteins sharing positive correlations and their positively correlated proteins compared against the set of proteins quantified. Significance for terms was determined by Fisher’s exact test with Benjamini–Hochberg correction for multiple hypothesis testing. The groupings in the figure represent our consolidated interpretation of all terms that were assigned to a group.

2.4 Citrate quantification by direct infusion MS/MS

Yeast strains were grown overnight in YPD at 30°C. After growth, OD595 was measured and cells were washed with PBS. YPDG was inoculated to an initial OD595 of 0.01 and grown at 30C for 24 h. After growth, OD595 was measured and the equivalent of 0.37 OD595 at 1 ml was harvested from each. These cells were pelleted, washed with PBS, pelleted, frozen with LN2 and stored at −80°C. To extract metabolites, each pellet was resuspended in 185 µl 75% methanol, placed at 100C for 5 min, vortexed for 30 s and cooled on ice. Cell debris was pelleted, and the supernatant was used for citrate quantification.

Mass spectrometry was performed on a Thermo Scientific Exploris 240, using a Thermo Scientific Nanospray Ion Source. One microliter of each extract was directly infused into the mass spectrometer. To quantify citrate, targeted MS/MS was performed, targeting the ion at 191.0192 m/z. The measured intensity of the fragment at 111.008 m/z was integrated across 811 scans to determine the total citrate present in each sample. Data analysis was performed using pyteomics (Goloborodko et al., 2013; Levitsky et al., 2019) (v4.41).

2.5 Clustering metabolite control of knockout yeast to predict gene function

SHAP values of the knockouts were clustered using a combination of Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., 2020) and Ordering Points To Identify Cluster Structure (OPTICS) (Ankerst et al., 1999) to determine clustering and likely function of unknown mitochondrial genes. For UMAP, the dimensionality of data (n_components) was set at 10, neighbors (n_neighbors) was set to 3, minimum distance (min_dist) was set to 0, and the distance metric (metric) was manhattan. For OPTICS, the minimum samples (min_samples) were set to 2. All other parameters were set to their defaults.

To generate the final clusters and account for the stochasticity of UMAP, UMAP and OPTICS clustering was repeated 1000 times for each metabolite. The clusters generated from each repetition were compared by creating a network with each node representing one of the knockouts and each weighted edge representing twice the number of times the knockouts clustered together of the 1000 repetitions.

The weighted edges, representing the membership of clusters, were combined across known, non-repeated metabolites with a model performance of R² > 0. To determine a subset of the most relevant connections, a linear regression was calculated between the edge weight and the rank of the edge when sorted in descending order. All edges with a weight that lay above the linear regression (a weight of 8210) were included as the relevant connections. Nodes were clustered in Cytoscape (Shannon et al., 2003) (v3.8.2) using the Markov Cluster Algorithm, https://ir.cwi.nl/pub/4463, MCL Cluster in clusterMaker (Morris et al., 2011). Layout of the network was calculated using the Prefuse Force Directed Layout.

2.6 Known connections network

To create the yeast metabolic network, a list of reactions, enzymes, compounds and enzymatic reactions was downloaded from Reactome (Gillespie et al., 2022) (v2.5.5). These datasets were combined to create a metabolic network consisting of all known pathways and their associated enzymes. The following nodes and associated edges were removed from the network due to their ambiguity and relative abundance across reactions: ‘PROTON’, ‘WATER’, ‘ATP’, ‘ADP’, ‘PPI’, ‘Pi’, ‘Protein-L-serine-or-L-threonine’, ‘Protein-Ser-or-Thr-phosphate’, ‘AMP’, ‘NAD’, ‘NADH’, ‘CO-A’, ‘NADP’, ‘NADPH’, ‘CARBON-DIOXIDE’, ‘GLT’, ‘S-ADENOSYLMETHIONINE’, ‘OXYGEN-MOLECULE’, ‘ACETYL-COA’, ‘AMMONIUM’, ‘ADENOSYL-HOMO-CYS’, ‘Nucleoside-Triphosphates’, ‘Peptides-holder’, ‘RNA-Holder’, ‘Cytochromes-C-Oxidized’, ‘Cytochromes-C-Reduced’, ‘GDP’, ‘Ubiquitin-C-Terminal-Glycine’ and ‘General-Protein-Substrates’. Edges between enzymes and compounds were assigned a weight of 3.

A list of all known Saccharomyces cerevisiae positive genetic interactions was downloaded from the Saccharomyces Genome Database (Cherry et al., 2012) (SGD). Every ORF absent from the network, i.e. those whose protein does not catalyze a metabolic reaction, were added as nodes and edges with a weight of 10 were created to link ORF nodes with known positive interactions. Weighted closest distance to citrate was calculated for every node using Dijkstra’s algorithm (Dijkstra, 1959). The closest distance can be summarized as 3 + 6*(metabolic distance) + 10*(positive interaction distance).

2.7 Comparison of clustering to proteome profile correlation

A list of all possible pairwise combinations of the 174 proteins represented by the knockout strains was generated. A set of all known genetic and physical interactions for the 174 genes were downloaded from the SGD. For each pairwise combination, it was determined if the pair was correlated through proteomic data, connected through clustering analysis and if it had known genetic or physical interactions. The overlap of correlations and clustering connections with known interactions was determined and plotted using matplotlib-venn (https://github.com/konstantint/matplotlib-venn).

2.8 Yeast strains and genetics

All strains used for translation assays were isogenic to S.cerevisiae W303 MAT a {leu2-3,112 trp1-1 can1-100 ura3-1 ade2-1 his3-11,15} obtained from Euroscarf and are listed in Supplementary Table S4. Chromosomal modifications were made by PCR-based amplification of cassettes followed by integration via homologous recombination, according to Janke et al. (2004) and applying lithium acetate transformation according to Daniel Gietz and Woods (2002). All plasmids and oligonucleotides used for this approach are listed in Supplementary Table S5. Transformants were validated via growth on selection media and PCR-based confirmation of locus-specific integration (Supplementary Fig. S1).

Strains for the other assays were in BY4743 background for the citrate quantification or BY4741 for the canavanine and hydrogen peroxide assays. All strains were obtained from Horizon Discovery and are listed in Supplementary Table S4.

2.9 Translation assay—media and culturing conditions

Strains were cultivated at 30°C and 170 rpm shaking. Full media (YEP) contained 1% yeast extract (Bacto, BD Biosciences), 2% peptone (Bacto, BD Biosciences) and 2% glucose, 2% galactose or 2% glycerol as carbon source. Synthetic complete (SC) media consisted of 0.17% yeast nitrogen base (Difco, BD Bioscience). 0.5% (NH₄)₂SO₄, 20 mg/l adenine, 20 mg/l uracil, 20 mg/l arginine, 15 mg/l histidine, 30 mg/l leucine, 30 mg/l lysine, 15 mg/l tryptophan, 30 mg/l isoleucine, 20 mg/l methionine, 50 mg/l phenylalanine, 20 mg/l threonine, 20 mg/l tyrosine, 150 mg/l valine and carbon sources as indicated above. All components were separately prepared in distilled water, autoclaved (25 min, 121°C, 210 kPa, except histidine and tryptophan, which were sterile filtered using 0.2 µm filters) and mixed before use. For solid media, 2% agar was admixed.

2.10 In vivo labeling of mitochondrial translation products

[35S]-methionine-based in vivo labeling of mitochondrial translation products was performed according to Carlström et al. (2021) with slight modifications. Cells were grown in a SC medium containing galactose the as carbon source (SC-Gal) to mid-logarithmic phase (approximately OD600 = 1.5–2) and washed three times in 5 ml H₂O. Strains were subsequently washed once in 5 ml SC-Gal media without amino acids and a volume corresponding to OD600 = 4 was harvested and resuspended in 1.5 ml SC-Gal media without amino acids. Amino acids were admixed (18 µg of each amino acid, without methionine) and incubated for 10 min at 30°C, 600 rpm shaking. To stop cytosolic translation, cycloheximide was added to a final concentration of 150 µg/ml and incubated for 2.5 min at 30°C, 600 rpm shaking. 3 µl of [35S]-methionine (10 mCi/ml) were added to start the labeling reaction. For pulse-labeling, 200 µl aliquots were harvested after 5, 10 and 15 min, mixed with 50 µl of Stop solution (1.85 M NaOH; 1 M β-mercaptoethanol; 20 mM PMSF) and 10 µl of 200 mM cold methionine, and placed on ice. To follow the stability of newly synthesized mitochondrial proteins, 40 µl of 200 mM cold methionine was added to the remaining cell suspensions and incubated at 37°C, 600 rpm (chase). Thereby, 200 µl samples were harvested 30, 60 and 90 min after the addition of cold methionine, mixed with stop solution as described above and placed on ice.

2.11 SDS-PAGE and immunoblotting

Trichloroacetic acid was added to [35S]-methionine-labeled samples with a final concentration of 14%, incubated for 30 min on ice and subsequently centrifuged for 30 min, 20 000 g at 4°C. Supernatants were carefully removed, and pellets were rinsed once in 1 ml 100% acetone. After further centrifugation for 30 min at 20 000 g at 4°C, supernatants were removed and pellets resuspended in 75 µl sample buffer (50 mM Tris-HCl, 2% SDS, 10% glycerol, 0.1% bromophenol blue, 100 mM DTT; adjusted to pH 6.8). Subsequently, samples were incubated for 10 min at 65°C, 1400 rpm shaking. Thirty microliters of the sample were loaded on 16%/0.2% SDS polyacrylamide/bis-acrylamide gels. After separation, proteins were transferred to a nitrocellulose membrane, which was stained with Ponceau S. Protein standard bands (PageRulerTM Plus Prestained Protein Ladder, ThermoFisher) on the nitrocellulose membrane were marked with diluted [35S]-methionine solution and the membranes were applied for autoradiography. Detection was performed with a Fujifilm FLA-9000 phosphorimager.

Membranes were subsequently applied for immunoblotting, using Mrp1 (Singh et al., 2020), Mrpl36 (Prestele et al., 2009) and Tom70 (kind gift from Prof. Rapaport, University of Tübingen) specific antibodies, as well as anti-rabbit secondary antibody (Sigma, A0545).

2.12 Drop dilution assay

To monitor cellular growth, yeast strains were cultivated in YEP media containing either glucose or glycerol to mid-logarithmic phase (approximately OD600 1.5–2). Cultures were washed three times in YEP media without carbon source and a volume corresponding to OD600 = 1 was harvested. Samples were resuspended in 1 ml YEP media without carbon source and three serial 1:10 dilutions thereof were created. Three microliters of cell suspensions were spotted on YEP agar plates either containing glucose or glycerol as carbon source. Plates were incubated for 2 days at 30°C and photographed with a VWR GenoPlex system.

2.13 Canavanine drop dilution

Cultures were grown for 18 h in 1 ml YPD for BY4741 or YPD + G418 for the knockout strains. Cultures were centrifuged at 3000 rcf for 3 min and pellets were resuspended in 3 ml YPG. After 24 h, the cultures were pelleted, washed with SC -Arg +glycerol and adjusted with SC -Arg +glycerol to an OD660 of 0.1 and plated onto SC—Arg +glycerol or SC—Arg +glycerol +canavanine at 0.25 μg/ml plates with dilutions of 1, 1:10, 1:100, 1:1000, 1:2000, 1:4000, 1:8000 and 1:16 000. Plates were incubated at 30°C and pictures were taken after 1 week and again at 18 days. Images of colony formation were captured using ImageLab software with a Bio-Rad GelDoc.

2.14 Canavanine viability

Cultures were grown for 18 h in 1 ml YPD or YPD + G418. Cultures were centrifuged at 3000 rcf for 3 min and pellets were resuspended in 3 ml YPG. After 24 h, the cultures were pelleted, washed with SC -Arg +glycerol and adjusted with SC -Arg +glycerol to an OD660 of 0.2. One hundred microliters were adjusted with SC -Arg +glycerol to an OD660 of 0.1 and plated onto YPD plates with dilutions of 1, 1:10 and 1:100 and refrigerated at 3°C for 72 h. The remaining culture was adjusted to an OD660 of 0.1 with SC -Arg +glycerol +1200 μg/ml canavanine (final concentration 600 μg/ml) and incubated with shaking at 30°C for 72 h. Cultures OD660 were centrifuged, washed and adjusted to 0.1 OD with SC -Arg +glycerol. Cultures were then plated onto the previously refrigerated YPD plates at dilutions of 1, 1:10 and 1:100. Plates were incubated at 30°C for 18 h. Images of colony formation were captured using ImageLab software with a Bio-Rad GelDoc.

2.15 Hydrogen peroxide viability

Cultures were grown for 18 h in 2 ml YPD or YPD + G418. One milliliter of each culture was centrifuged at 3000 rcf for 3 min and pellets were resuspended in 3 ml YPG and incubated for 24 h at 30°C. To the remaining preculture, 2 ml YPD was added and incubated at 30°C for 5 h. For each set of cultures after incubation, the cultures were pelleted, washed with YPD or YPG and adjusted with YPD or YPG to an OD660 of 0.2. For fermentation, 100 μl of each culture was added to 100 µl YPD or YPD + 128 mM hydrogen peroxide. For respiration, 100 µl of each culture was added to 100 µl YPG or YPG +1024 mM hydrogen peroxide. Cultures were exposed to hydrogen peroxide for 30 min. After treatment, cells were plated onto YPD plates at dilutions of 1, 1:10, 1:100 and 1:1000. Plates were incubated for 18 h at 30°C. Images of colony formation were captured using ImageLab software using a bio-rad GelDoc.

2.16 Quantification of images

To quantify the growth of the drop dilution assays, images were exported in the TIF format at a DPI of 600. ImageJ (Schneider et al., 2012) (v1.53k) was used to measure the brightness (R+G+B)/3 of circles 0.015 in ^2. Six circles were used as a background and circles measuring the drops were centered. Circles were drawn after each measurement to mark each location (Supplementary Data). To calculate growth ratios, the average background measurement was subtracted from each brightness measurement. Then, the experimental brightness was divided by the control brightness for each strain to calculate a ratio of growth. Average ratios were plotted using seaborn (Waskom, 2021) and differences between strains were compared using ANOVA and Tukey’s post hoc test (Tukey, 1949).

2.17 Hydrogen peroxide zone of inhibition

Cultures were grown for 18 h in 1 ml YPD or YPD + G418. One milliliter of each culture was centrifuged at 3000 rcf for 3 min and pellets were resuspended in 3 ml YPG and incubated for 24 h at 30°C. The OD660 was adjusted to 1 for each culture. One milliliter of culture was plated onto 25 ml YPG plates and allowed to dry. To create the hydrogen peroxide gradient, a central section of each plate was removed using a 1 ml pipette tip. One hundred microliters of 3% hydrogen peroxide were added to the central hole and allowed to diffuse. Plates were incubated for 1 week at 30°C. Images of lawn formation were captured using ImageLab software using a bio-rad GelDoc.

2.18 Seahorse assay

To prepare the seahorse plate, 50 µl of poly-L-lysine (0.1 mg/ml) was added to each well and allowed to sit for 2 h. The solution was aspirated and washed with 100 µl sterile water. The coated plate was stored at 3°C until ready for the assay. On the day of the assay, the plate was brought to room temperature and 80 µl of seahorse media was added to each well. An additional 100 µl of seahorse media were added to wells acting as baselines. Injections were prepared to have a final concentration of 5 mM ethanol or succinate, 1 µM FCCP, 1 µM rotenone and 1 µM antimycin A.

To prepare cells for the seahorse assay, cells were grown overnight in 1 ml YPD. After growth to the stationary phase, cells were pelleted and resuspended in 4 ml YPG. Cells were grown for 25 h. Cells were pelleted and resuspended in seahorse media (6.6 g/l YNB + NH4SO4) to a final OD660 of 0.38. Each sample was diluted an additional 1:5 in seahorse media and 100 µl of culture were placed into each well of the prepared seahorse plate. The plate was centrifuged at 250 rcf for 3 min and incubated at 30C for 30 min.

Plates were measured on a Seahorse XF-96. A total of 18 measurements of the oxygen consumption rate (OCR) and extracellular acidification rate (ECAR) were taken over 96 minutes, with 10 technical replicates for each strain. Six initial measurements were taken as a baseline, six measurements were taken after the injection of succinate, three measurements after the injection of FCCP and three final measurements after the injection of rotenone/antimycin A. Data collected were analyzed using Agilent Wave (v2.6.3.5) and pandas (Reback et al., 2022) (v1.4.3) and plotted using seaborn (Waskom, 2021) (v0.11.2) and matplotlib (Hunter, 2007) (v3.5.1).

2.19 Web resource

A website was created using the python package Dash (1.19.0). The site is hosted at https://mimal.app.

The first page ‘Correlation’ allows for the plotting of correlations between arbitrary combinations of proteins, metabolites, or SHAP values of a protein’s control over a specific metabolite.

The second page ‘SHAP Summary’ allows viewing the SHAP summary plot for a metabolite showing the most important proteins that predict that metabolite. This page also shows the true versus predicted metabolite values from the test set to assess that model’s performance. This should be viewed to assess whether the model for that metabolite is worth interpreting; generally, an R² score over 0.5 is preferred but this cutoff is arbitrary. This page also shows the plot of the mean absolute SHAP versus the Spearman’s rho for each of the proteins, which we call a ‘horseshoe plot’ because usually high mean absolute SHAP correlates with a high Spearman correlation, producing the shape of a horseshoe. However, in some cases, there are high values of SHAP that are not also high values of correlation with the metabolite. These instances point to the benefit of SHAP interpretation over simple correlation. This page also shows a SHAP force plot at the bottom for the selected condition. This shows how the various proteins contributed to this individual condition.

Third, the page ‘Network’ enables viewing the complete network of connections between the knockouts shown in Figure 3. Clicking on a node in the main network will show the immediate connections in the sub-network panel.

Fig. 3. — MIMaL clustering, interpretation and validation. (A) Overview of the method to find connections between conditions using dimensionality reduction, clustering and network analysis. SHAP values were calculated for all proteins across all knockouts. UMAP was used to reduce dimensionality to 10 dimensions, the first two are displayed graphically. UMAP dimensions were clustered with OPTICS. UMAP and OPTICS were repeated 1000 times for each metabolite. (B) A graph was constructed where each edge is linearly proportional to the count of co-clustering across the 69 000 clustering repetitions and with a minimum cutoff for including edges. (C) Autoradiographic image of gel assessing mitochondrial translation in wild-type and *ydl157c*Δ cells were treated with cycloheximide and using 35S-methionine for 15 min (pulse) at 30°C. The labeling was stopped by adding excess cold methionine and the temperature was increased to 37°C to induce protein destabilization (chase for a total of 90 min). (D) Resistance to canavanine stress. Strains were grown on SC media minus arg +2.5 µg/ml canavanine for 18 days. *ISC1* and *SDH9* knockout strains were connected to *pil1*Δ, which was previously shown to resist canavanine stress like *can1*Δ (positive control). Both *isc1*Δ and *sdh9*Δ showed resistance to canavanine compared to wild-type. (E) Oxygen consumption in responses to succinate as the sole carbon source measured by seahorse respirometry. Responses to succinate were significantly different (P-value = 0.001, Tukey’s HSD) between *SDH1* and *SDH9* knockouts. (F) The strains *fmp40*Δ and *fmp52*Δ were tested for resistance to hydrogen peroxide stress under respiration (F) and fermentation (G) conditions and compared using image analysis of drop dilution assays (Supplementary Fig. S4C). Differences between all strains were significant (P-value = 0.001, Tukey’s HSD) under fermentation conditions, and significant (P-value = 0.001, Tukey’s HSD) between wild-type and the others under respiration conditions. Drop dilution image colors are inverted to enhance visibility

Finally, the website includes a page that allows user to upload their own data from one omic layer and train a model to predict one output from another omic layer. This page will show the prediction performance on the test set as a scatterplot, and also the relations between the conditions as a clustered UMAP network.

3 Results

Data were obtained from a previous multi-omic study in yeast (Stefely et al., 2016) consisting of the proteome and metabolome of wild-type or one of 174 single gene knockout yeast strains grown under fermentation and respiration conditions, for a total of 348 multi-omic profiles after computing change relative to wild-type controls. In total, the overall dataset consisted of 3690 proteins and 273 metabolites. After imputation, data were split into training (n = 313) and test (n = 35) datasets (see Section 2). Multiple different models for each metabolite were explored (Fig. 1B), and their performance was determined by mean squared error and R² between test data model predictions and true values. The Extra Trees model was chosen as it had among the best average performance across metabolites (Fig. 1B) and decision tree-based models have specialized model interpretation methods. Positive R² scores between true and predicted quantities of metabolites in the test set were observed for nearly all identified metabolites (Fig. 1C). We include all the metabolite predictions here to show that the method does not always work for every metabolite. Model prediction performance should be checked on the test set before proceeding because interpreting a model that has not learned anything is unlikely to yield biological insight.

3.1 Model interpretation values as ProC

To determine the learned relationships between the proteome and metabolites, TreeSHAP was used to calculate the contribution of each protein input to the predicted level of each of the metabolites across the entire dataset. One well-predicted metabolite, citric acid (R² = 0.695) was chosen as an example (Fig. 1D and E). We chose citrate because it is extensively studied as part of the TCA cycle, and as an abundant and ionizable metabolite, we knew we could easily measure perturbations to citrate’s quantity by mass spectrometry. The proteins with the greatest SHAP value magnitude for mef1Δ under respiration were Aat2 (25.46% of total magnitude), Ald5 (4.19%), and Idh2 (3.96%) (Fig. 1F). This suggests that in the MEF1 knockout, the resulting changes in these three proteins are driving the difference in citrate, not the absence of Mef1 protein. Unlike previous works that directly measure metabolite–protein interactions (Hicks et al., 2021), we cannot infer the nature of the interaction between citrate and these proteins. We asked whether these connections reflect metabolic control by proteins by quantifying citrate in single gene knockout strains. We chose Aat2 and Ald5 proteins for this follow-up experiment because they were the most important for explaining this gene knockout, and they were among the most important for explaining citrate across all knockouts in the test set. Citrate production in AAT2 and ALD5 homozygous deletion mutants were compared to the BY4743 wild-type and a MEF1 deletion mutant (Fig. 1F) and significantly different citrate abundance was seen between wild-type and aat2Δ (Student’s t-test P-value =7.22E−4) and wild-type and ald5Δ (Student’s t-test P-value = 1.53E−3), matching the relationships predicted by the SHAP values. This result suggests that SHAP values from model interpretation may reveal ProC over a metabolite to a greater degree than correlations (Supplementary Fig. S2).

3.2 ProC values reveal new inter-omic connections

To further explore the relationship between proteins with the highest average ProC over citrate, GO term enrichment was performed (Supplementary Fig. S3). This analysis revealed several functional pathways that predict citrate (Supplementary Table S1A–D) related to the TCA cycle, stress responses, and respiration, providing further validation of these new protein connections to citrate discovered by MIMaL. This may also reflect the logic of the machine learning algorithm and SHAP, choosing proteins most reflective of these functional pathways and their correlated proteins.

Given that our approach discovers hundreds of new connections between proteins and metabolites, we asked whether these connections are largely new or known. To determine this, we used the top 10 proteins with the greatest overall average magnitude of SHAP values. These top discovered connections for citrate (Fig. 2A) were mapped onto known positive genetic and metabolic interaction networks (Fig. 2B). AAT2, IDH1, IDH2 and ALD5 were close to citrate, being either one metabolic step, or one positive genetic interaction distance from an enzyme that acts directly on citrate. The remaining connections were more distant, representing new protein connections to citric acid. Notably, OAC1, BAT1, YPK1 and PHO81 all lay at the median or above in calculated distance across all proteins and metabolites (Fig. 2C).

Fig. 2. — MIMaL reveals new connections between proteins and metabolites. (A) Top 10 SHAP values for citric acid across all conditions, sorted by mean magnitude SHAP by each protein. (B) A network consisting of the metabolic pathways present in *S.cerevisiae* was constructed from data obtained from Biocyc. Connections between proteins through metabolites have a weight of six. Positive genetic interactions among all ORFs in yeast were downloaded from SGD and added to the network with a weight of 10. Distance to citrate was calculated using the Dijkstra algorithm, and can be represented by 3+(# genetic interactions)*10 + (# metabolic reactions)*6. (C) The overall distribution of distances was plotted as a histogram. The network was organized by distance to citrate and the proteins representing the top 10 SHAP values for citric acid prediction were highlighted, along with their paths to citric acid. (D) A representation of the total number of nodes and connections between each category

3.3 Summaries of ProC uncover gene function

Because the data used here are from single gene knockouts including uncharacterized genes, we wondered if we could use the similarity of ProC profiles to predict gene function. We used dimension reduction and clustering of ProC profiles for each metabolite to discover relationships between conditions (see Fig. 3 legend and Section 2 for details).

3.3.1 Ydl157c and YJR120W regulate mitochondrial translation

YDL157C and YJR120W are two genes of unknown function associated with the mitochondria. Clustering of knockouts across metabolites (Fig. 3A and B) revealed that these two knockouts frequently cluster with gene knockout strains related to mitochondrial translation (Supplementary Table S2). In vivo pulse-chase, radiolabeling of mitochondrial translation in wild-type and ydl157cΔ and yjr120wΔ revealed changes in mitochondrial translation (Fig. 3C, Supplementary Fig. S4A and B).

ydl157cΔ had a global reduction of mitochondrial translation, and the absence of YJR120W resulted in a dysregulation of translation. In yjr120wΔ, Var1, Cox2, Cox3 and Atp6 are down regulated, with more pronounced downregulation seen for Cox3 and Atp6. Cytb however was upregulated. This alteration in translation might reflect previously suggested interactions of YJR120W. YJR120W is upstream of ATP2 on the yeast chromosome, and the deletion of YJR120W was previously noted to alter ATP2’s expression (Stefely et al., 2016). Atp2 is a part of the F1 sector of the F1Fo ATP synthase, which regulates the mitochondrial translation of ATP6 and ATP8 (Rak and Tzagoloff, 2009). In line with these observations, the deletion of YDL157C severely impaired respiratory growth, while the effect of the deletion of YJR120W was less pronounced (Supplementary Fig. S4C). We propose naming YJR120W and YDL157C as ‘Determines Mitochondrial prOteome’ or DMO1 and DMO2, respectively.

Although the connections between translation and YDL157C and YJR120W were not discovered in the original paper that reported these data, closer inspection of the correlation between proteome profiles resulting from gene knockouts may have revealed this relationship. We wondered if our summary strategy of ProC values could reveal new gene connections that would not be apparent from an omic profile similarity alone.

3.3.2 YJL045W and ISC1 are involved with eisosomal function

To further test the relationships predicted by the clustering network, three additional clusters were analyzed for their connections to incompletely characterized genes. The first of these clusters included YJL045W, now annotated as SDH9 as it is a paralog of SDH1 (Byrne and Wolfe, 2005; Singh et al., 2020). Unexpectedly, sdh9Δ was found to lack direct connections to sdh1Δ under respiration conditions in the final trimmed network but rather had the greatest connection to Pil1, a key protein in eisosomal structure (Moreira et al., 2009). The eisosome is a membrane structure involved in membrane transport. One transporter associated with the eisosome is Can1, an arginine transporter whose deletion confers resistance to the toxic, non-proteinogenic amino acid canavanine (Larimer et al., 1978). Disruption of the eisosome through deletion of PIL1 has also been shown to provide resistance to canavanine (Spira et al., 2012). To test the connection between SDH9 and the eisosome, the growth of deletion strains of SDH9, SDH1, CAN1, PIL1, and another connection to PIL1, ISC1, were tested on SC media without arginine + canavanine. All tested strains, other than sdh1Δ, which had a growth defect on SC -arg (Supplementary Fig. S5A), were shown to grow in the presence of canavanine better than wild-type (Fig. 3D). Additionally, all strains but pil1Δ showed significantly higher viability when exposed to very high concentrations of canavanine over 72 h (Supplementary Fig. S5B). However, as sdh1Δ showed a growth defect on SC -arg, the link between SDH1 and eisosomal function remains ambiguous.

We wondered why SDH9, a gene annotated to function in complex II, would convey resistance to canavanine. To test the link between SDH1 and SDH9, respiratory responses were quantified; we used succinate as a source of electrons to complex II and sdh9Δ showed a response more similar to wild-type than sdh1Δ. OCR spiked in sdh9Δ when exposed to succinate, while this was not observed in sdh1Δ (Fig. 3E). The different responses to succinate demonstrate the distinctiveness of the two succinate dehydrogenases and suggest unique functions for each.

Also of note is the resistance of isc1Δ to canavanine. Isc1 is an enzyme involved in sphingolipid hydrolysis to ceramides (Sawai et al., 2000) and is activated by cardiolipin (Vaena de Avalos et al., 2005). Proteins involved in cardiolipin biosynthesis are significantly enriched in the cluster containing isc1Δ and pil1Δ (Supplementary Table S1E). This supports an interplay between cardiolipin, ceramides and the eisosome as suggested in the literature (Vaena de Avalos et al., 2005; Walther et al., 2007) that could be explored further in future studies.

3.3.3 FMP52 is linked to the oxidative stress response

The final two clusters analyzed include another uncharacterized gene in both respiration and fermentation conditions, FMP52. fmp52Δ was found to have the greatest connection weight to fmp40Δ. Fmp40 is an AMPylator involved in the oxidative stress response (Sreelatha et al., 2018). In addition, Fmp52 had the second greatest connection weight to Aim25, a protein of unknown function involved in the oxidative stress response (Jose et al., 2016). Based on these connections, it seemed likely that fmp52Δ would have an altered response to oxidative stress and therefore show a difference in resistance to oxidative stressors, such as hydrogen peroxide. To test this hypothesis, cells under respiration and fermentation conditions were exposed to hydrogen peroxide and their viability was determined after 30 min (Fig. 3F and G, Supplementary Fig. S6A). The resistance to hydrogen peroxide was significantly higher in both FMP40 and FMP52 deletion strains compared to WT controls. Under fermentation conditions, there was a significant difference between the resistance of fmp40Δ and fmp52Δ, while under respiration conditions there was no significant difference. This coincides with the weight of the connections between FMP40 and FMP52 in the network; the weight of the edge connecting them is substantially larger in the respiration cluster. As a separate test, fmp40Δ and fmp52Δ were grown under respiration conditions in a zone-of-inhibition assay with hydrogen peroxide. A similar result was found, with both the fmp40Δ and fmp52Δ lawns growing closer to the source of hydrogen peroxide (Supplementary Fig. S6B).

3.3.4 Comparison of known interactions

To compare the performance of this clustering method with proteomic correlations, we looked at the representation of known genetic and physical interactions among the top selected connections (Supplementary Table S3) from the clustering analysis and the correlations between proteomes of knockout strains. As an example, of the 873 known genetic and physical interactions between the genes represented by the knockout strains under fermentation conditions, 45 were uniquely represented across all proteomic correlations, 31 shared by correlations and clustering, and 85 uniquely represented by clustering analysis (Supplementary Fig. S7).

3.4 MIMaL website

To enable exploration of the results and data presented here, a website is available at https://mimal.app/. This site has four pages in addition to the landing page (see Section 2 for a detailed description of the pages). The model performance for any metabolite can easily be checked with a scatterplot on the ‘SHAP Summary’ page; in this case, we show that biotin is well predicted with R² score of 0.862 between true and predicted quantities (Supplementary Fig. S8A). The 20 most important proteins for predicting biotin are shown in a SHAP summary plot, which showed that RKI1 is the most important regulator of biotin’s quantity (Supplementary Fig. S8B). Comparison of the correlation between every protein’s correlation with biotin and the mean average SHAP for that protein showed that some proteins are important for model interpretation (high y value) but have a low correlation with biotin (x value near 0, Supplementary Fig. S8C). One such example is Dur12 protein. The ‘correlation’ tab allows inspection of the correlation between Dur12 and biotin, which is poor. The correlation between the SHAP value of Dur12’s control over biotin versus the quantity of biotin is more correlated, although some interesting patterns in the data are apparent (Supplementary Fig. S8D).

Using the ‘Network’ tab enables exploration of the network relationships between conditions shown in Figure 3 to generate additional hypothesis that readers may want to explore. For example, if interested in the endoplasmic reticulum membrane complex, we can zoom in on those points and see that they are connected to several uncharacterized proteins (Fmp10, Fmp16 and Fmp27). They are also connected to some characterized proteins, including Dic1 and Mpc2, which are both transporters of metabolites containing carboxylic acids. Testable hypothesis that investigators may derive from these data include that ER transmembrane complex proteins may also be involved in mitochondria protein import, or that these carboxylic acid transporters are important for protein folding in the ER.

The website also enables MIMaL analysis of arbitrary multiomic datasets uploaded by the user. The input should be multiple molecule measurements from one omic layer and the output should be a single molecule measurement from a different omic layer in the same samples. The site will train a model and report the performance in the form of the true versus predicted quantities for the output molecule. It will also show the clustered UMAP of the similarity between input conditions based on the SHAP values. The most important consideration when using this is that the number of samples should be large, probably at least 100. Second, after training the model the performance on the test set in true versus predicted quantities should be inspected before using the model interpretation results.

4. Discussion

Data presented here demonstrate that machine learning can effectively predict one layer of omic data from another layer of omic data. The predictions work well for most metabolites in this dataset, but when models cannot learn to predict a metabolite’s quantity from the protein quantities, this could be due to a lack of biological signal in the proteins that relates to the metabolite or due to poor quality measurements of either the metabolite or the proteins that should control that metabolite.

However, the focus of this work is not on predicting metabolites from proteins, but rather how SHAP model interpretation values can reflect true biological relationships that represent ProC over a metabolite. As one example case, we validated two proteins predicted to control citrate that are not directly involved in producing or consuming citrate based on known metabolism pathways. More generally, we found that pathway enrichment analysis of proteins that control citrate reveal expected and new pathways that regulate citrate. Network analysis for all discovered proteins that interact with citrate revealed that most discovered connections are distant based on known genetic and metabolic interactions.

Although a previous paper also predicted metabolite quantities from protein quantities (Zelezniak et al., 2018), the focus of their paper was not the interpretation of machine learning models to find new connections between proteins and metabolites. Rather, this important prior work was largely about exploring the cellular metabolic states induced by kinase knockouts, the nature of yeast metabolic networks and the assertion that the proteome can, in fact, predict the metabolome. They did perform model interpretation, but this was presented as a method to find global metabolic regulator proteins. The key distinction of our work is that we focus on how machine learning model interpretation with SHAP can reveal how proteins control metabolites globally. This allowed us to explore the utility of the SHAP-derived ProC values, and we demonstrated here how the ProC values derived from model interpretation can reveal functions for characterized and uncharacterized genes.

As one additional example demonstrating the utility of these predicted ProC values, we used dimension reduction and clustering of ProC values to discover similarity between experimental conditions. When each study condition is a single gene knockout, as was the case in this study, this analysis reveals new connections between those genes in the form of a similarity network. The similarities revealed by this method are different from those obtained from simply clustering the proteomics profiles from each condition. The utility of this method was demonstrated by predicting and validating functions for several uncharacterized and characterized yeast genes.

Evaluation of the true and false positive rate for discovery of connections suggested by SHAP would require a ground truth dataset of known interactions, and unfortunately our knowledge of metabolic connections is incomplete. We attempted to understand how SHAP values for a protein differ from simple correlation between proteins and metabolites in Supplementary Figure S2 (readers can perform the same analysis for any metabolite of interest at mimal.app). This question of which discovered connections may be real is what prompted us to do the network analysis in Figure 2 to check how these connections compare to what is known. We were encouraged to see that some of the connections for citrate reflect known relevant upstream or downstream metabolic pathways, for example, Idh1 and Idh2. We were also encouraged to validate some of these new connections (Ald5 and Aat2 connecting to citrate in Fig. 1E and F). We also validated that summaries of the discovered connections reveal relationships between experimental conditions, in this case suggesting and validating functions for five genes.

Although in this study we focused on proteomic and metabolomic data integration, these methods can be used to discover connections between any two omic layers. Additionally, the data we have generated cataloging new connections between yeast proteins and metabolites will serve as a reference to better understand orphan mitochondrial proteins and basic yeast metabolism. Overall, we foresee the methods developed here will be useful as new multi-omic integration techniques that provide a unique view into the relationships between multi-omic levels and the similarities between biological perturbations.

Supplementary Material

btac631_Supplementary_Data

Click here for additional data file.^{(5.2MB, zip)}

Acknowledgements

This research was completed in part with computational resources and technical support provided by the Research Computing Center at the Medical College of Wisconsin. We thank Monika Zielonka and the Redox & Bioenergetics Shared Resource at the Medical College of Wisconsin Cancer Center for help with seahorse data collection. We thank H. Adam Steinberg for graphic design help. We thank Jong-In Park for helpful discussions. We thank Yuming Jiang for help with experiments.

Funding

This work was supported by the United States National Institute of Health (NIH) NIGMS [R35 GM142502], the Swedish research council and the Knut and Alice Wallenberg foundation.

Conflict of Interest: Q.D. and J.G.M. are named inventors on a provisional patent related to this technology.

Contributor Information

Quinn Dickinson, Department of Biochemistry, Medical College of Wisconsin, Milwaukee, WI 53226, USA; Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA.

Andreas Aufschnaiter, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden.

Martin Ott, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden; Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, University of Gothenburg, Gothenburg, Sweden.

Jesse G Meyer, Department of Biochemistry, Medical College of Wisconsin, Milwaukee, WI 53226, USA; Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA.

Data Availability

Supporting data are available at https://doi.org/10.5281/zenodo.6537297. MS data are available under the identifier MSV000090100 at https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=ba70b1440b2b4c488323fa6644b332cb.

References

Ankerst M. et al. (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: Association for Computing Machinery (SIGMOD ’99), pp. 49–60.
Bindea G. et al. (2009) ClueGO: a cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics, 25, 1091–1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
Byrne,K.P. and , Wolfe,K.H. (2005) The yeast gene order browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res., 15, 1456–1461. 10.1101/gr.3672305. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carlström A. et al. (2021) The analysis of yeast mitochondrial translation. Methods Mol. Biol. (Clifton, N.J.), 2192, 227–242. [DOI] [PubMed] [Google Scholar]
Chai,H. et al. (2021) Integrating multi-omics data through deep learning for accurate cancer prognosis prediction. Computers in Biology and Medicine, 134, 104481. 10.1016/j.compbiomed.2021. [DOI] [PubMed] [Google Scholar]
Cherry J.M. et al. (2012) Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res., 40, D700–705. [DOI] [PMC free article] [PubMed] [Google Scholar]
Daniel Gietz R., Woods R.A. (2002) Transformation of yeast by lithium acetate/single-stranded carrier DNA/polyethylene glycol method. Methods Enzymol., 350, 87–96. 10.1016/s0076-6879(02)50957-5., Oxford University Press; [DOI] [PubMed] [Google Scholar]
Dijkstra E.W. (1959) A note on two problems in connexion with graphs. Numer. Math., 1, 269–271. [Google Scholar]
Gillespie M. et al. (2022) The reactome pathway knowledgebase 2022. Nucleic Acids Res., 50, D687–D692. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goloborodko A.A. et al. (2013) Pyteomics—a python framework for exploratory data analysis and rapid software prototyping in proteomics. J. Am. Soc. Mass Spectrom., 24, 301–304. [DOI] [PubMed] [Google Scholar]
Hicks K.G. et al. (2021) Protein-metabolite interactomics reveals novel regulation of carbohydrate metabolism. bioRxiv, p. 2021.08.28.458030. doi: 10.1101/2021.08.28.458030. [DOI]
Hunter J.D. (2007) Matplotlib: a 2D graphics environment. Comput. Sci. Eng., 9, 90–95. [Google Scholar]
Janke C. et al. (2004) A versatile toolbox for PCR-based tagging of yeast genes: new fluorescent proteins, more markers and promoter substitution cassettes. Yeast (Chichester, England), 21, 947–962. [DOI] [PubMed] [Google Scholar]
Jose L.A.-L. et al. (2016) Slm35 links mitochondrial stress response and longevity through TOR signaling pathway. Aging (Albany NY), 8, 3255–3268. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim M. et al. (2016) Multi-omics integration accurately predicts cellular state in unexplored conditions for Escherichia coli. Nat. Commun., 7, 13090. [DOI] [PMC free article] [PubMed] [Google Scholar]
Krassowski M. et al. (2020) State of the field in Multi-Omics research: from computational needs to data mining and sharing. Front. Genet., 11, 610798. [DOI] [PMC free article] [PubMed] [Google Scholar]
Larimer F.W. et al. (1978) Mutagenicity of methylated N-nitrosopiperidines in Saccharomyces cerevisiae. Mutat. Res., 57, 155–161. [DOI] [PubMed] [Google Scholar]
Levitsky L.I. et al. (2019) Pyteomics 4.0: five years of development of a python proteomics framework. J. Proteome Res., 18, 709–714. [DOI] [PubMed] [Google Scholar]
Louhimo R., Hautaniemi S. (2011) CNAmet: an R package for integrating copy number, methylation and expression data. Bioinformatics (Oxf., Engl.), 27, 887–888. [DOI] [PubMed] [Google Scholar]
Lundberg S.M., Lee S.-I. (2017) A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc. (NIPS’17), Red Hook, NY, USA, pp. 4768–4777.
McInnes L. et al. (2020) UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 [cs, stat] [Preprint].
Miao Z. et al. (2021) Multi-omics integration in the age of million single-cell data. Nat. Rev. Nephrol., 17, 710–724. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mo Q. et al. (2013) Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc. Natl. Acad. Sci. USA, 110, 4245–4250. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moreira K.E. et al. (2009) Pil1 controls eisosome biogenesis. Mol. Biol. Cell, 20, 809–818. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morris J.H. et al. (2011) clusterMaker: a multi-algorithm clustering plugin for cytoscape. BMC Bioinformatics, 12, 436. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pedregosa F. et al. (2011) Scikit-learn: machine learning in python. J. Mach. Learn. Res., 12, 2825–2830. [Google Scholar]
Picard M. et al. (2021) Integration strategies of multi-omics data for machine learning analysis. Comput. Struct. Biotechnol. J., 19, 3735–3746. [DOI] [PMC free article] [PubMed] [Google Scholar]
Prestele M. et al. (2009) Mrpl36 is important for generation of assembly competent proteins during mitochondrial translation. Mol. Biol. Cell, 20, 2615–2625. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rak M., Tzagoloff A. (2009) F1-dependent translation of mitochondrially encoded Atp6p and Atp8p subunits of yeast ATP synthase. Proc. Natl. Acad. Sci. U S A, 106, 18509–18514. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reback J. et al. (2022) pandas-dev/pandas: Pandas 1.4.3. Zenodo. doi: 10.5281/zenodo.6702671. [DOI]
Ronen J. et al. (2019) Evaluation of colorectal cancer subtypes and cell lines using deep learning. Life Sci. Alliance, 2, e201900517. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sawai H. et al. (2000) Identification of ISC1 (YER019w) as inositol phosphosphingolipid phospholipase C in Saccharomyces cerevisiae. J. Biol. Chem., 275, 39793–39798. [DOI] [PubMed] [Google Scholar]
Schapire R.E. (2013) Explaining AdaBoost. In: Schölkopf B.et al. (eds) Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik. Springer, Berlin, Heidelberg, pp. 37–52. [Google Scholar]
Schneider C.A. et al. (2012) NIH image to ImageJ: 25 years of image analysis. Nat. Methods, 9, 671–675. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shannon P. et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res., 13, 2498–2504. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sharifi-Noghabi H. et al. (2019) MOLI: multi-omics late integration with deep neural networks for drug response prediction. Bioinformatics, 35, i501–i509. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen R. et al. (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25, 2906–2912. [DOI] [PMC free article] [PubMed] [Google Scholar]
Singh A. et al. (2019) DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics (Oxf., Engl.), 35, 3055–3062. [DOI] [PMC free article] [PubMed] [Google Scholar]
Singh A.P. et al. (2020) Molecular connectivity of mitochondrial gene expression and OXPHOS biogenesis. Mol. Cell, 79, 1051–1065.e10. [DOI] [PubMed] [Google Scholar]
Spira F. et al. (2012) Patchwork organization of the yeast plasma membrane into numerous coexisting domains. Nat. Cell Biol., 14, 640–648. [DOI] [PubMed] [Google Scholar]
Sreelatha A. et al. (2018) Protein AMPylation by an evolutionarily conserved pseudokinase. Cell, 175, 809–821.e19. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stefely J.A. et al. (2016) Mitochondrial protein functions elucidated by multi-omic mass spectrometry profiling. Nat. Biotechnol., 34, 1191–1197. [DOI] [PMC free article] [PubMed] [Google Scholar]
Subramanian I. et al. (2020) Multi-omics data integration, interpretation, and its application. Bioinform. Biol. Insights, 14, 1177932219899051. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tukey J.W. (1949) Comparing individual means in the analysis of variance. Biometrics, 5, 99–114. [PubMed] [Google Scholar]
Vaena de Avalos S. et al. (2005) The phosphatidylglycerol/cardiolipin biosynthetic pathway is required for the activation of inositol phosphosphingolipid phospholipase C, Isc1p, during growth of Saccharomyces cerevisiae. J. Biol. Chem., 280, 7170–7177. [DOI] [PubMed] [Google Scholar]
Vaske C.J. et al. (2010) Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics, 26, i237–i245. [DOI] [PMC free article] [PubMed] [Google Scholar]
Walther T.C. et al. (2007) Pkh-kinases control eisosome assembly and organization. EMBO J., 26, 4946–4955. [DOI] [PMC free article] [PubMed] [Google Scholar]
Waskom M.L. (2021) Seaborn: statistical data visualization. J. Open Source Softw., 6, 3021. [Google Scholar]
Wilson C.M. et al. (2019) Multiple-kernel learning for genomic data mining and Prediction. BMC Bioinformatics, 20, 426. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zelezniak A. et al. (2018) Machine learning predicts the yeast metabolome from the quantitative proteome of kinase knockouts. Cell Syst., 7, 269–283.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btac631_Supplementary_Data

Click here for additional data file.^{(5.2MB, zip)}

Data Availability Statement

[btac631-B1] Ankerst M. et al. (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: Association for Computing Machinery (SIGMOD ’99), pp. 49–60.

[btac631-B2] Bindea G. et al. (2009) ClueGO: a cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics, 25, 1091–1093. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B9089529] Byrne,K.P. and , Wolfe,K.H. (2005) The yeast gene order browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res., 15, 1456–1461. 10.1101/gr.3672305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B3] Carlström A. et al. (2021) The analysis of yeast mitochondrial translation. Methods Mol. Biol. (Clifton, N.J.), 2192, 227–242. [DOI] [PubMed] [Google Scholar]

[btac631-B6247313] Chai,H. et al. (2021) Integrating multi-omics data through deep learning for accurate cancer prognosis prediction. Computers in Biology and Medicine, 134, 104481. 10.1016/j.compbiomed.2021. [DOI] [PubMed] [Google Scholar]

[btac631-B5] Cherry J.M. et al. (2012) Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res., 40, D700–705. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B6] Daniel Gietz R., Woods R.A. (2002) Transformation of yeast by lithium acetate/single-stranded carrier DNA/polyethylene glycol method. Methods Enzymol., 350, 87–96. 10.1016/s0076-6879(02)50957-5., Oxford University Press; [DOI] [PubMed] [Google Scholar]

[btac631-B7] Dijkstra E.W. (1959) A note on two problems in connexion with graphs. Numer. Math., 1, 269–271. [Google Scholar]

[btac631-B8] Gillespie M. et al. (2022) The reactome pathway knowledgebase 2022. Nucleic Acids Res., 50, D687–D692. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B9] Goloborodko A.A. et al. (2013) Pyteomics—a python framework for exploratory data analysis and rapid software prototyping in proteomics. J. Am. Soc. Mass Spectrom., 24, 301–304. [DOI] [PubMed] [Google Scholar]

[btac631-B10] Hicks K.G. et al. (2021) Protein-metabolite interactomics reveals novel regulation of carbohydrate metabolism. bioRxiv, p. 2021.08.28.458030. doi: 10.1101/2021.08.28.458030. [DOI]

[btac631-B11] Hunter J.D. (2007) Matplotlib: a 2D graphics environment. Comput. Sci. Eng., 9, 90–95. [Google Scholar]

[btac631-B12] Janke C. et al. (2004) A versatile toolbox for PCR-based tagging of yeast genes: new fluorescent proteins, more markers and promoter substitution cassettes. Yeast (Chichester, England), 21, 947–962. [DOI] [PubMed] [Google Scholar]

[btac631-B13] Jose L.A.-L. et al. (2016) Slm35 links mitochondrial stress response and longevity through TOR signaling pathway. Aging (Albany NY), 8, 3255–3268. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B14] Kim M. et al. (2016) Multi-omics integration accurately predicts cellular state in unexplored conditions for Escherichia coli. Nat. Commun., 7, 13090. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B15] Krassowski M. et al. (2020) State of the field in Multi-Omics research: from computational needs to data mining and sharing. Front. Genet., 11, 610798. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B16] Larimer F.W. et al. (1978) Mutagenicity of methylated N-nitrosopiperidines in Saccharomyces cerevisiae. Mutat. Res., 57, 155–161. [DOI] [PubMed] [Google Scholar]

[btac631-B17] Levitsky L.I. et al. (2019) Pyteomics 4.0: five years of development of a python proteomics framework. J. Proteome Res., 18, 709–714. [DOI] [PubMed] [Google Scholar]

[btac631-B18] Louhimo R., Hautaniemi S. (2011) CNAmet: an R package for integrating copy number, methylation and expression data. Bioinformatics (Oxf., Engl.), 27, 887–888. [DOI] [PubMed] [Google Scholar]

[btac631-B19] Lundberg S.M., Lee S.-I. (2017) A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc. (NIPS’17), Red Hook, NY, USA, pp. 4768–4777.

[btac631-B20] McInnes L. et al. (2020) UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 [cs, stat] [Preprint].

[btac631-B21] Miao Z. et al. (2021) Multi-omics integration in the age of million single-cell data. Nat. Rev. Nephrol., 17, 710–724. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B22] Mo Q. et al. (2013) Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc. Natl. Acad. Sci. USA, 110, 4245–4250. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B23] Moreira K.E. et al. (2009) Pil1 controls eisosome biogenesis. Mol. Biol. Cell, 20, 809–818. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B24] Morris J.H. et al. (2011) clusterMaker: a multi-algorithm clustering plugin for cytoscape. BMC Bioinformatics, 12, 436. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B25] Pedregosa F. et al. (2011) Scikit-learn: machine learning in python. J. Mach. Learn. Res., 12, 2825–2830. [Google Scholar]

[btac631-B26] Picard M. et al. (2021) Integration strategies of multi-omics data for machine learning analysis. Comput. Struct. Biotechnol. J., 19, 3735–3746. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B27] Prestele M. et al. (2009) Mrpl36 is important for generation of assembly competent proteins during mitochondrial translation. Mol. Biol. Cell, 20, 2615–2625. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B28] Rak M., Tzagoloff A. (2009) F1-dependent translation of mitochondrially encoded Atp6p and Atp8p subunits of yeast ATP synthase. Proc. Natl. Acad. Sci. U S A, 106, 18509–18514. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B29] Reback J. et al. (2022) pandas-dev/pandas: Pandas 1.4.3. Zenodo. doi: 10.5281/zenodo.6702671. [DOI]

[btac631-B30] Ronen J. et al. (2019) Evaluation of colorectal cancer subtypes and cell lines using deep learning. Life Sci. Alliance, 2, e201900517. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B31] Sawai H. et al. (2000) Identification of ISC1 (YER019w) as inositol phosphosphingolipid phospholipase C in Saccharomyces cerevisiae. J. Biol. Chem., 275, 39793–39798. [DOI] [PubMed] [Google Scholar]

[btac631-B32] Schapire R.E. (2013) Explaining AdaBoost. In: Schölkopf B.et al. (eds) Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik. Springer, Berlin, Heidelberg, pp. 37–52. [Google Scholar]

[btac631-B33] Schneider C.A. et al. (2012) NIH image to ImageJ: 25 years of image analysis. Nat. Methods, 9, 671–675. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B34] Shannon P. et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res., 13, 2498–2504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B35] Sharifi-Noghabi H. et al. (2019) MOLI: multi-omics late integration with deep neural networks for drug response prediction. Bioinformatics, 35, i501–i509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B36] Shen R. et al. (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25, 2906–2912. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B37] Singh A. et al. (2019) DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics (Oxf., Engl.), 35, 3055–3062. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B38] Singh A.P. et al. (2020) Molecular connectivity of mitochondrial gene expression and OXPHOS biogenesis. Mol. Cell, 79, 1051–1065.e10. [DOI] [PubMed] [Google Scholar]

[btac631-B39] Spira F. et al. (2012) Patchwork organization of the yeast plasma membrane into numerous coexisting domains. Nat. Cell Biol., 14, 640–648. [DOI] [PubMed] [Google Scholar]

[btac631-B40] Sreelatha A. et al. (2018) Protein AMPylation by an evolutionarily conserved pseudokinase. Cell, 175, 809–821.e19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B41] Stefely J.A. et al. (2016) Mitochondrial protein functions elucidated by multi-omic mass spectrometry profiling. Nat. Biotechnol., 34, 1191–1197. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B42] Subramanian I. et al. (2020) Multi-omics data integration, interpretation, and its application. Bioinform. Biol. Insights, 14, 1177932219899051. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B43] Tukey J.W. (1949) Comparing individual means in the analysis of variance. Biometrics, 5, 99–114. [PubMed] [Google Scholar]

[btac631-B44] Vaena de Avalos S. et al. (2005) The phosphatidylglycerol/cardiolipin biosynthetic pathway is required for the activation of inositol phosphosphingolipid phospholipase C, Isc1p, during growth of Saccharomyces cerevisiae. J. Biol. Chem., 280, 7170–7177. [DOI] [PubMed] [Google Scholar]

[btac631-B45] Vaske C.J. et al. (2010) Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics, 26, i237–i245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B46] Walther T.C. et al. (2007) Pkh-kinases control eisosome assembly and organization. EMBO J., 26, 4946–4955. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B47] Waskom M.L. (2021) Seaborn: statistical data visualization. J. Open Source Softw., 6, 3021. [Google Scholar]

[btac631-B48] Wilson C.M. et al. (2019) Multiple-kernel learning for genomic data mining and Prediction. BMC Bioinformatics, 20, 426. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac631-B49] Zelezniak A. et al. (2018) Machine learning predicts the yeast metabolome from the quantitative proteome of kinase knockouts. Cell Syst., 7, 269–283.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Multi-omic integration by machine learning (MIMaL)

Quinn Dickinson

Andreas Aufschnaiter

Martin Ott

Jesse G Meyer

Roles

Abstract

Motivation

Results

Availability and implementation

Supplementary information

1 Introduction

Fig. 1.

2 Materials and methods

2.1 Yeast protein–metabolite data imputation

2.2 Machine learning and model optimization

2.3 Yeast protein–metabolite SHAP analysis

2.4 Citrate quantification by direct infusion MS/MS

2.5 Clustering metabolite control of knockout yeast to predict gene function

2.6 Known connections network

2.7 Comparison of clustering to proteome profile correlation

2.8 Yeast strains and genetics

2.9 Translation assay—media and culturing conditions

2.10 In vivo labeling of mitochondrial translation products

2.11 SDS-PAGE and immunoblotting

2.12 Drop dilution assay

2.13 Canavanine drop dilution

2.14 Canavanine viability

2.15 Hydrogen peroxide viability

2.16 Quantification of images

2.17 Hydrogen peroxide zone of inhibition

2.18 Seahorse assay

2.19 Web resource

Fig. 3.

3 Results

3.1 Model interpretation values as ProC

3.2 ProC values reveal new inter-omic connections

Fig. 2.

3.3 Summaries of ProC uncover gene function

3.3.1 Ydl157c and YJR120W regulate mitochondrial translation

3.3.2 YJL045W and ISC1 are involved with eisosomal function

3.3.3 FMP52 is linked to the oxidative stress response

3.3.4 Comparison of known interactions

3.4 MIMaL website

4. Discussion

Supplementary Material

Acknowledgements

Funding

Contributor Information

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases