Interpretation of network-based integration from multi-omics longitudinal data

Antoine Bodein; Marie-Pier Scott-Boyer; Olivier Perin; Kim-Anh Lê Cao; Arnaud Droit

doi:10.1093/nar/gkab1200

. 2021 Dec 9;50(5):e27. doi: 10.1093/nar/gkab1200

Interpretation of network-based integration from multi-omics longitudinal data

Antoine Bodein ¹, Marie-Pier Scott-Boyer ², Olivier Perin ³, Kim-Anh Lê Cao ⁴, Arnaud Droit ^5,^✉

PMCID: PMC8934642 PMID: 34883510

Abstract

Multi-omics integration is key to fully understand complex biological processes in an holistic manner. Furthermore, multi-omics combined with new longitudinal experimental design can unreveal dynamic relationships between omics layers and identify key players or interactions in system development or complex phenotypes. However, integration methods have to address various experimental designs and do not guarantee interpretable biological results. The new challenge of multi-omics integration is to solve interpretation and unlock the hidden knowledge within the multi-omics data. In this paper, we go beyond integration and propose a generic approach to face the interpretation problem. From multi-omics longitudinal data, this approach builds and explores hybrid multi-omics networks composed of both inferred and known relationships within and between omics layers. With smart node labelling and propagation analysis, this approach predicts regulation mechanisms and multi-omics functional modules. We applied the method on 3 case studies with various multi-omics designs and identified new multi-layer interactions involved in key biological functions that could not be revealed with single omics analysis. Moreover, we highlighted interplay in the kinetics that could help identify novel biological mechanisms. This method is available as an R package netOmics to readily suit any application.

INTRODUCTION

Cost reductions of DNA sequencing in addition to other high-throughput multi-omics technologies have revolutionized many research fields ranging from personalized medicine (1,2) to systems biology (3,4). These innovations have led to new biological insights and a better understanding of living organisms (5–7). Thus, enabling the assessment of most biological layers, this democratisation of high-throughput technologies has created large datasets representing different biomolecules that necessitate specific processing and statistical methods (8,9). Multi-omics trials typically collect different types of biomolecules (mRNA, proteins, metabolites, etc.) from the same biological samples with the ultimate goal of highlighting the interaction between biological layers that could be responsible for causing complex phenotype or diseases (2). Even more so most biological phenomenon involves complex interactions between layers that vary through time (10). Adapted multi-omics time-course methods to integrate and accurately capture interactions among those biological layers are thus now required of and fully capture interactions within and between omic layers.

To describe complex interactions and regulatory mechanisms behind biological systems, mathematical models, such as network, are built to interpret and reverse-engineer cellular functions. Networks are used to represent all relevant interactions taking place in a biological systems (11). In networks, molecules (genes, proteins, metabolites) are reduced to a series of nodes that are connected to each other by edges. Edges represent the pairwise relationships, interactions, between two molecules within the same network. Molecular networks have become extremely popular and have been used in every area of biology to model for example transcriptional regulation mechanisms, physical protein–protein interactions (12,13), or metabolic reactions (14). Networks come with valuable properties and useful topological features such as degree distribution to identify highly connected nodes or shortest paths which determine proximity between two nodes. On a different scale, network modularity defines sub-network units with highly connected nodes in respect to the rest of the network. These sub-networks, also known as modules, often share a similar function. Thus, the ‘guilt by association’ property assumes that known or unknown highly connected molecules should be functionally related (15).

Inference methods for network construction are often applied to a single omic layer to identify interaction between molecules. However, this does not directly elucidate interaction across multiple omic layers (16). To connect these layers, a first approach require prior knowledge of across omic molecular networks such as publicly-available databases (17). This approach is based on a legacy limited to model organisms and may not reflect the current biological condition. A second method use multivariate data-driven methods that statistically infers correlations between molecules based on multi-omics data. However, this approach may have many possible solutions. A combination of the two methods could improve multi-omics network construction.

The ultimate goal of multi-omics networks is to connect phenotypes to biological mechanisms and their regulators. Analysis of the interactome identifies direct neighbors and modules linked to a phenotype. However, direct neighbors can lead to false discoveries. Phenotypes and molecules may be linked by irrelevant interactions. Furthermore, our knowledge of the interactome is not complete and we can miss true interactions or interactions with more distant molecules (18). Based on the work of Page et al. (19), propagation algorithms recently became the state-of-the-art to investigate gene disease associations and also gene function prediction (18–20). From the known association, the signal is iteratively propagated through the network. When a steady state is reached, new nodes can be added to the initial association by their propagation score reflecting their proximity to the starting nodes. It thus highlights potential new phenotype-related targets. New advances in randoms walks algorithms allow to propagate the signal in heterogeneous multi-layered networks which improves association prediction (21).

In this paper, we propose to build hybrid multi-omics networks from longitudinal multi-omics data in order to facilitate the interpretation of multi-layers systems (Figure 1). This methodology is based in the first place on the modelling and clustering of expression profiles with similar behaviours over time. It relies on both accurate network reconstruction methods and knowledge-based reviewed interactions between either molecules of the same or different types. Finally, a random walk algorithm was used to identify and make new hypothesis about links between omics molecules and key biological functions or mechanisms. The main objectives of this method is to provide a versatile framework for multi-omics network-based integration but also to provide interpretation guidelines to explore these networks to further highlight key intra-omics and inter-omics mechanisms and interactions. We illustrate this approach through three case studies. These studies have different experimental designs with different omics data types, timepoints and organisms to demonstrate that the proposed approach is able to deal with a wide range of situations.

Figure 1. — Overview of the proposed approach. (A) Description of the experimental design: the same biological material is sampled at several time points across several omic layers indicated in different colors. Each omic data is normalised using both platform-specific and time-specific normalisation steps. (B) Multi-Omics Network is built using both inference-based and knowledge-based methods to connect intra- and cross-layered biological features or molecules (mRNA, proteins, metabolites). Measured molecules are clustered into groups of similar expression profiles over time and corresponding nodes formed kinetic sub-networks. Over-representation analysis is performed to add an extra layer of functional annotation. Propagation analysis is performed on specific nodes of interest, called *seeds* (biological function, gene, protein, metabolite, etc.) to identify closely related molecules.

MATERIALS AND METHODS

This approach proposes pre-processing, modelling and clustering steps for multi-omics longitudinal data. It mainly emphasizes on network-based integration and multi-layered network exploration (Figure 2) using network propagation algorithms in order to provide new biological insights. We developed the R package netOmics which wraps the method proposed below. It was developed to simplify and reproduce the integration and interpretation steps. It provides documentation and practical guidelines to build and explore (longitudinal) multi-omics networks.

Figure 2. — Workflow diagram illustrating the main steps of network based integration using longitudinal multi-omics data.

Multi-Omics longitudinal design

We define longitudinal multi-omics designs as follow. From the same biological sample, omics data are produced (RNA, proteins, etc.) at different timepoints. Raw data are processed to get (NxP) abundance tables by omics data type with samples in rows and molecules or biological features (RNA, proteins, metabolites, etc.) in columns. We call these tables blocks. In this framework, there is no need to have matching timepoints between blocks because we use a modelling step to interpolate missing timepoints and even out uneven designs.

Pre-processing of longitudinal multi-omics data

We assume each omics data is a raw count table resulting from bioinformatics quantification pipelines (22,23). Low counts are filtered and data are normalized according to the type of data in each table. We also applied a filter on time profiles and kept only molecules with the highest expression fold change between the lowest and highest point over the entire time course, as described in (24). For each case study, we adapted these filters to take into account platform-specific dynamic range of values.

timeOmics: modelling and clustering of longitudinal multi-omics data

The timeOmics approach (24) was used to cluster multi-omics molecules with similar expression profiles over time. The framework is based on two main steps:

The first step uses a Linear Mixed Model Spline framework (25) to model every molecule over the time-course by taking into account the inter-individual variation. This framework tests different models and assigns the best model to each molecule according to a goodness of fit test. One of its benefits is to allow interpolation of missing timepoints and thus accommodate non-regular experimental designs with missing data.
The second step clusters the modelled expression profiles in groups of similar expressions over time. This is performed using various multivariate projection-based methods implemented in mixOmics (26). With a 3-blocks omic design, we used multi-block Projection on Latent Structures (block PLS) to cluster time profiles from multi-omics datasets. Optimal number of clusters is determined by maximising the average silhouette coefficient.

The objective of these preliminary steps is to summarise each molecule with an expression pattern. This tag will then be used in the multi-omics network reconstruction to build cluster-specific sub-networks.

Network reconstruction

In order to build a multi-omics network, we started by building a map for each layer (genes, proteins, metabolites, etc) using a combination of both data-driven and knowledge-driven building methods. The method used is specific to the type of data and are described below. As mentioned in section 2.3, we kept the clustering information by building several sub-networks per kinetic cluster. We also built an entire network without the cluster labels.

Data-driven network reconstruction

For gene expression data, we relied on Gene Regulatory Network inference. This class of method tries to reverse engineer complex regulatory mechanisms in the organisms and infer relationship between genes. ARACNe (27) is a co-expression-based inference algorithm which identify most likely TF–targeted genes interaction by estimating mutual information, a similarity distance between pairs of transcript expression profile. We used ARACNe algorithm on gene expression profile to infer potential TF- targeted genes interaction from the gene expression dataset.

Knowledge-driven network

Some kind of interactions cannot be revealed by inference methods and has to be experimentally determined (e.g. protein–protein interactions, ChIP). We then relied on reviewed interactions found in specialized databases to connect molecules of the same type and also get cross-layered interactions (binding, enzyme, regulation) From the measured molecules in the dataset, we collected all possible interactions through targeted databases. To maximize cross-layered connectivity, we also included non-measured proteins or metabolites which were directly connected to measured molecules.

Protein interactions

Physical or functional protein–protein interaction (PPI) is one kind of interaction that is difficult to predict and PPI network inference algorithms for MS data are still in their infancy (28). For human proteomic data, we relied on the BioGRID database (29) which records >1.8 million proteins and genetics interactions from major model organisms. This database collects experimentally determined physical protein–protein interactions and also connects transcriptomics and proteomics layers with regulatory relationship (TF–gene interactions). For other model organisations, we can rely on more specialised or custom databases (30).

Metabolite interactions

We used the KEGG Pathway database (31) which records a collection of manually drawn metabolic pathways representing molecular interaction, reaction and relation networks for human and other model organisms. We used KEGG to link metabolite compounds involved in the same reactions. We also connect metabolites to genes and/or proteins if they are involved in the same biochemical enzymatic reaction thanks to KEGG Orthology database that links genes to high-level functions.

The objective of this building step is to provide an entire multi-omics network composed of three main layers and several sub-networks specific to kinetic clusters. The next steps will focus on the analysis of these multi-omics networks.

Enrichment analysis

Over representation analysis (ORA) helps to find enriched and meaningful biological insights from interacting biomolecules. This task was achieved using gProfiler2 (32), first on each kinetic cluster and then on all molecules from the entire network. We focused on the three Gene Ontology (GO) terms: Biological Process (BP), Molecular Function (MF) and Cellular Component (CC). P-values were corrected with gProfiler2 custom multiple testing correction algorithm (g:SCS) (33) and only significant terms were considered (g: SCS < 0.05). Size and significance of P-values distributions were compared between both clusters and entire network approaches. We also used Fisher’s combined probability test (34) for multi-omics P-values comparison.

Random walk

As described in (21), in an undirected graph G = (V, E), the random walk (RW) starts from a node (v₀), called seed, and simulate a particle that randomly moves from one node v_t to another v_{t + 1} following the probability distribution: Inline graphic

(1)

where d(x) is the degree of the node x in the graph G. Valdeolivas et al. (21) also added the possibility to restart at the initial node to avoid dead ends in multi-layered networks. When a steady state is reached, the algorithm gives a probability score to each node of the network which represents the proximity of that node and the seed.

We then used the R package RandomWalkRestartMH (21) to apply random walk with restart algorithm on multi-omics network with three main purposes to guide interpretation. (i) RW can be used to identify multi-omics nodes and their interactions linked to mechanisms of interest (e.g. GO:BP). Therefore, a GO term node can be turned into a seed and RW can be performed from that starting point. Then, a sub-network with the top 25 closest nodes to that seed can be built. Naively, all significant GO term nodes were iteratively turned into seeds. We then screened sub-networks containing different types of molecules to highlight the multi-omics aspects of the integration. We applied this analysis on both kinetic cluster sub-networks and entire network. (ii) RW can be used for nodes function prediction. Similar as above, unlabelled nodes can be turned into seeds. We relied on gProfiler2 annotations to identify nodes without any known functions. For an unlabelled seed, a list of ranked nodes was produced and the closest GO term node was assigned to that seed. We repeated this for the three GO ontologies (BP, MF, CC) on the entire network. (iii) Combined to kinetic clusters, RW can locate regulatory mechanisms and find interacting nodes with different expression profiles in the entire network. Once again, each node can be turned into seed and sub-networks were built using the 10 closest nodes. Then we screened sub-networks with different cluster labels from the seeds that might reveal underlying regulatory mechanisms.

Data

In the following section, two published multi-omics case studies are presented. These applications have longitudinal multi-omics designs, but each block was analysed separately. We modelled these dataset with multi-layer networks to highlight the multi-omics interactions. Specific analysis steps for each example are described, along with specific databases used for each case study.

Case study 1: HeLa cell cycling study

Understanding the complex relationship between gene expression, translation product and protein abundance is the key to decrypt biological mechanisms. Genes undergo several steps of regulation before turning into proteins including transcription, translation, folding, post-translational modification and eventually degradation (35) studied the poor correlation between mRNA and related protein levels during cell cycling regulation using triplicate measurement of mRNA expression (microarray), translation product (PUNCH-P) and protein quantification (MS) from synchronized HeLa S3 cells. Authors sampled cells during phases G1, S and G2 (Figure 3).

Figure 3. — HeLa cell cycling study: overview of the analysis: (A) Experimental design: three samples are collected for each steps G1, S and G2. For each sample, RNA, translation products and proteins are quantified. (B) Multi-Omics longitudinal clustering: Clustered expression profiles of mRNA, translation products and proteins. Each line represents the modelled abundance of a molecule during the cycle. 4 clusters were obtained using the timeOmics clustering approach. Cluster compositions are detailed in Table 2. (C) Multi-omics network layout: represents the connection between entities and their different types of interactions. The network was composed of four layers: a gene layer build from mRNA expression, a PPI layer build from measured proteins and BioGRID known interactions, a metabolite layer from KEGG pathways and GO term layer from enrichment analysis.

The authors were able to find clusters of patterns in the expression of genes and their products related to key functions of the cell cycle. These functions were up- or down-regulated at different stages. For example, Cell division, Cytokinesis, Spindle, Chromosome segregation and Microtubule based movement biological processes shared similar patterns and were down-regulated in G1/S and up-regulated in S/G2 transitions. Interestingly, multi-omics showed that a gene and its derivatives can have different expression patterns during a given time course, which highlights underlying molecular mechanisms, such as mRNA or protein degradation.

In this first case study, we intended to highlight the multi-omics interactions involved in the control of HeLa cell cycle during G1, S and G2 phases.

To do so, we filtered the RMA-normalized mRNA with a 2-Fold-Change filter. iBAQ normalized translatome products and proteins were filtered with a different 3-Fold-Change threshold because of the differences in platform-specific dynamic ranges.

We modelled the expression of every molecule with LMMS including the variations of the three replicates for each of the three timepoints. We built multi-omics clusters of expression profiles based on the direction of variation between each step. mRNA and proteins networks were built with ARACNe and BioGrid interaction databases, respectively. Protein coding and TF regulated information were used to connect these layers where UniProtID were converted into Gene Symbols (https://www.uniprot.org/uploadlists/) to ensure proper matching. We also included metabolite reactions from KEGG connected to protein enzymes to metabolites. We performed ORA with mRNA, translatome products and proteins against GO:BP, GO:MF and GO:CC terms for both clusters and entire set of molecules. Finally, we performed RW on this multi-omics network.

Case study 2: dynamic maize responses to aphid feeding

Maize (Zea mays) is one of the most productive cereal crops in the world. However, the plant is subject to numerous biotic attacks caused by herbivorous insects and it is therefore critical to understand the maize defense mechanisms in order to improve its productivity. Aviner et al. (35) studied the dynamic of maize response to aphid feeding and they found that mutants in benzoxazinoid biosynthesis and terpene synthases genes do affect aphid proliferation.

To measure gene expression changes over time, authors exposed five two-weeks maize plants (B73) to corn leaf aphid (Rhopalosiphum maidis) during four days (Figure 7). They also include five control maize plants with the same growing conditions minus the exposure to aphids. During this time course, they sampled the five exposed and five controls plants at six timepoints (i.e. 2, 4, 8, 24, 48, 96 h) and conducted gene expression profiling with RNA-seq, LC-TOF-MS nontargeted metabolite quantification as well as amino-acids, phospholipids and terpenes targeted metabolite quantification.

Figure 7. — Dynamic response to maize aphid feeding study: (A) Experimental design: five samples are collected for each condition and for each timepoint (2,4,8,24,48,96 h). For each sample, RNA and targeted metabolites are quantified. (B) Multi-Omics longitudinal clustering: Clustered expression profiles of mRNA, TF–coding genes and metabolites. Each line represents the modelled abundance of a molecule during the time-course. (C) Multi-omics network layout: represents the connection between entities and their different types of interactions.

In the paper, the authors focused on the genes and metabolites involved in the maize response to aphids. We intended to go a little further by addressing the complex regulatory relationships that may exist between these multi-omic actors over time.

For this example, we focused on the first five timepoints (i.e. 2, 4, 8, 24, 48 h). We discarded genes which were not differentially expressed between exposed and control groups and those having an expression difference <2-fold over the entire time course. We split TF-coding genes from other transcripts to get a 3-omics-like experimental design. Finally, we discarded nontargeted metabolite as they were not annotated and performed longitudinal clustering with multiple block PLS. For network reconstruction, we used ARACNe on genes and TF. We used the Protein-Protein Interaction database for Maize (PPIM) (36) to identify protein-coding genes and extracted direct neighbor interactions. PPIM contains more than 2.7 millions interactions between protein, TF and gene interactions. These interactions are either predicted or experimentally determined from public databases such as UniProt (37), BioGrid (29), DIP (38), IntAct (39) and MINT (40) or using to text mining. We mainly focused on validated interactions and predicted ones qualified as high-confidence interactions that represents 155 845 interactions with top Inline graphic highest decision scores (36). Then we connected the measured metabolites and those involved in the same reaction to the genes/proteins using KEGG. We performed ORA with genes, TF and proteins on GO:BP, GO:MF and GO:CC terms. Finally, we performed RW on this multi-omics network.

Case study 3: diabetes seasonal study

Diabetes is the seventh leading cause of death in the world according to the WHO (41) and its prevalence is constantly increasing (42). The role of the integrative Human Microbiome Project (iHMP) is to study the impact of host-microbiome relationships on diabetes mechanisms of appearance and progression (43).

In the study from Integrative (44), 105 subjects with diabetes (Insulin Sensitive and Insulin Resistant) were followed over a period of more than 4 years. Each patient was sampled every 3 months and every 3–4 days during stress periods. This resulted in an average of 27 samples per patient. At each visit, 51 clinical tests were performed. From the blood samples, transcriptomics, proteomics, metabolomics, cytological and microbiome data (oral and gut) were produced.

Thanks to this complex design, Sailani et al. (45) identified several molecules linked to diabetes and important biological changes related to annual seasonality. They found different shift in expression in all omics molecules and they group them into two main expression profile clusters. We expect that network integration will provide a deeper understanding of the diabetes process and the role of host-microbiome interaction in that disease.

Since microbiome data is highly variable we decided to perform the modelling and integration on only one individual. We also reduced the time period to one year (7 timepoints) for the same reason. We performed the integration of transcriptomics, proteomics, metabolomics, cytokines, gut microbiota data and clinical variables. We applied a Fold-Change filter for each omics block, except the clinical variables, with a specific threshold. We modelled the data with LMMS and used a multiple block PLS to cluster the data. We applied the sparse multi-block PLS to identify a key signature per cluster. For network reconstruction, we applied the ARACNe algorithm (27) separately on transcriptomics, unlabelled metabolomics and clinical variables. We used the BioGRID interaction database (29) to build a PPI network from proteins and cytokines and used the information of TTRUST (46) and TF2DNA (47) to connect these molecules to the transcriptomics layer. We applied the SparCC algorithm (|ρ| ≥ 0.3) to build a microbiome network as recommended in (48). With limited knowledge about host-microbiota interactions, microbiota network were connected to each of the layers listed above by computing the spearman rank correlation on the expression data and kept only the highest interactions between microbiota and other layer (|ρ| ≥ 0.99). We followed the same procedure to connect the unlabelled metabolomics data and clinical variables to the other layers. Lastly, we performed an enrichment analysis with the transcriptomics, proteomics and cytokines molecules against Gene Ontology and we added the significant GO terms as a GO layer by connecting the significant GO terms to the corresponding RNA, proteins or cytokines. In addition, we performed a gene-related disease enrichment analysis with MedlineRanker (49) a data mining tool, which searches for publication abstracts in which genes were linked to diseases (MedlineRanker parameters: Min number of citations = 5; min number of genes significantly associated with a disease = 2; FDR:0.05). Like GO terms, disease terms were added to the network and connected to the related genes. Finally, we performed RW on this multi-omics network and the cluster specific sub-networks.

RESULTS

Case study 1: HeLa cell cycling study

In this example, we studied the HeLa cell cycling from Aviner et al. (35). This dataset was composed of three omic layers and three timepoints.

Pre-processing, modelling and clustering of time profile

HeLa cell dataset was assembled into a single table focused on proteins. After missing value removal, this dataset was composed of 6785 mRNA, 4102 translation products and 5023 proteins (Table 1). 448 mRNA, 2672 translation products and 4295 proteins remained after the Fold-Change filtering step. Finally, once all molecules were modelled over time and noisy profiles were removed, 446 mRNA, 2318 translation products and 4237 proteins remained.

Table 1.

HeLa cell cycling study: Initial number of mRNA, translation products and proteins, and remaining molecules after Fold-Change and noisy modelled profile filtering. A threshold of 2-FC was applied to mRNA and 3-FC to both proteins and translation products

	Raw counts	Fold change	LMMS filter
mRNA	6785	448	446
Trans. products	4102	2672	2318
Proteins	5023	4295	4237

Open in a new tab

Time profile clustering

Once all the molecules were modelled over time and noisy profiles were removed, remaining expression profiles were clustered according to their differential expression value between two timepoints. This clustering resulted in 4 clusters (Table 2) with a silhouette coefficient of s_sil = 0.75. We compared this clustering with the timeOmics approach with four clusters for similar parameters. timeOmics resulted in a lower silhouette coefficient (s_sil = 0.63). In the first approach, Cluster 1 included the largest number of molecules (n = 4443). It is characterized by a decrease between the G1 and S phases and then an increase between S and G2/M. This seemed to be the main kinetic pattern for proteins since it contained the majority of them Inline graphic . Cluster 2 (n = 1444), included the largest amount of translation products () and it was characterized by an increase in expression from the first to the last step. Cluster 3 (n = 958) showed an opposite pattern compared to the cluster 2 with a decrease across the overall time course. Finally Cluster 4 (n = 156) included the least number of molecules and no protein appeared to follow an increase and decrease pattern.

Table 2.

HeLa cell cycling study: clusters composition

	mRNA	Translation products	Proteins
Cluster 1	233	187	4023
Cluster 2	81	1244	119
Cluster 3	35	828	95
Cluster 4	97	59	0

Open in a new tab

Multi-layered network reconstruction

The first layer to be reconstructed was the gene inference network from the mRNA. We used the ARACNe algorithm to build a network by kinetic cluster but also to build a entire network composed of all the mRNA. Supplementary Table S1 shows statistics about the sub-networks such as the number of connected / disconnected nodes and edges.

The second layer was the PPI network. Proteins were connected to each other using the BioGRID interaction database. As for the genes, PPI sub-networks were built for each kinetic cluster but we also built a entire PPI network composed of all the proteins. In addition, we included BioGRID proteins which were directly connected to the measured ones (first degree neighbours). Number of nodes and edges are detailed in Supplementary Table S1.

These first two layers were combined thanks to two types of links. First, protein-coding information linked 126 genes to their corresponding proteins. Second, TF-regulated information from TF2DNA (47) and TTRUST databases linked 57 proteins to 403 genes (16 846 interactions). In addition, KEGG pathway database was also used to link protein enzymes to metabolite reactions. This new layer was composed of 1595 metabolites connected to 2213 proteins (12 694 interactions).