Skip to main content
Computational and Structural Biotechnology Journal logoLink to Computational and Structural Biotechnology Journal
. 2024 Apr 30;23:1945–1950. doi: 10.1016/j.csbj.2024.04.053

Knowledge-guided learning methods for integrative analysis of multi-omics data

Wenrui Li a,, Jenna Ballard b, Yize Zhao c, Qi Long a,
PMCID: PMC11087912  PMID: 38736693

Abstract

Integrative analysis of multi-omics data has the potential to yield valuable and comprehensive insights into the molecular mechanisms underlying complex diseases such as cancer and Alzheimer's disease. However, a number of analytical challenges complicate multi-omics data integration. For instance, -omics data are usually high-dimensional, and sample sizes in multi-omics studies tend to be modest. Furthermore, when genes in an important pathway have relatively weak signal, it can be difficult to detect them individually. There is a growing body of literature on knowledge-guided learning methods that can address these challenges by incorporating biological knowledge such as functional genomics and functional proteomics into multi-omics data analysis. These methods have been shown to outperform their counterparts that do not utilize biological knowledge in tasks including prediction, feature selection, clustering, and dimension reduction. In this review, we survey recently developed methods and applications of knowledge-guided multi-omics data integration methods and discuss future research directions.

Keywords: Knowledge-guided learning, Multi-omics, Integration, Prediction, Feature selection, Clustering, Dimension reduction

1. Introduction

Rapid advances in technologies have led to collection of various types of -omics data, such as genomics and proteomics data, in many biomedical studies. One notable example is the Alzheimer's disease neuroimaging initiative (ADNI) study [35], which collected multi-omics data from Alzheimer's disease (AD) patients, mild cognitive impairment subjects, and elderly controls. See Table 1 for several representative multi-omics databases. Integrative analyses of these datasets have the potential to deliver more comprehensive insights into the biological systems under study than is possible with individual modalities. For example, an integrative multi-omics approach revealed novel molecular and pathway alterations in Alzheimer's disease and led to better prediction of cognitive decline [9]. At the same time, there are a number of analytical challenges in integrative analysis of multi-omics data. For instance, -omics data are usually high-dimensional, leading to the classical small n large p problem. In addition, when individual genes in important pathways have relatively weak signal, it can be difficult to detect them on their own. To address these challenges, there is a growing body of literature on knowledge-guided learning methods for integrative analysis of multi-omics data that can incorporate biological knowledge such as functional genomics and functional proteomics via graph representations.

Table 1.

List of multi-omics databases.

Database Name Modalities Biological Domain Source
TCGA (The Cancer Genome Atlas) [42] Genomics, epigenomics, transcriptomics, proteomics Cancer https://www.cancer.gov/ccg/research/genome-sequencing/tcga
CCLE (Cancer Cell Line Encyclopedia) [6] Genomics, transcriptomics, epigenomics, proteomics, metabolomics Cancer https://sites.broadinstitute.org/ccle/
CPTAC (Clinical Proteomic Tumor Analysis Consortium) [12] Genomics, proteomics (proteogenomics), imaging Cancer https://proteomics.cancer.gov/programs/cptac
METABRIC (Molecular Taxonomy of Breast cancer International Consortium) [11] Genomics, transcriptomics, clinical Breast Cancer https://www.mercuriolab.umassmed.edu/metabric
TARGET (Therapeutically Applicable Research to Generate Effective Treatments) [1] Genomics, transcriptomics, epigenomics Pediatric cancers https://www.cancer.gov/ccg/research/genome-sequencing/target/using-target-data
ADNI (Alzheimer's Disease Neuroimaging Initiative) [35] Imaging, genetics, clinical, biospecimen Alzheimer's disease https://adni.loni.usc.edu/
The Aging Atlas Database[2] Genomics, transcriptomics, epigenomics, proteomics, pharmacogenomics, metabolomics Age-related changes https://ngdc.cncb.ac.cn/aging/index
MVIP (Multi-omics Portal of Virus Infection) [39] Genomics, transcriptomics, epigenomics Virology https://mvip.whu.edu.cn/
iMETHYL[20] Genomics, transcriptomics, epigenomics (DNA methylation) Genetics, biology, molecular biology http://imethyl.iwate-megabank.org/

Extensive research has yielded much information on the association structure among features (e.g., genes and proteins) and underlying networks that can be represented by graphs. For instance, the Kyoto Encyclopedia of Genes and Genomes (KEGG) project [17] links genomic information with higher order functional information by representing higher order functions as a network of interacting molecules. Another example is the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) [29], which systematically collects and integrates protein–protein physical interactions and functional associations. See Table 2 for several notable databases containing various types of biological knowledge.

Table 2.

Representative databases for biological knowledge.

Database Name Biological Knowledge Source
KEGG (Kyoto Encyclopedia of Genes and Genomes) [17] Molecular interaction, reaction and relation networks https://www.genome.jp/kegg/pathway.html
Reactome (Reactome Pathway Database) [10] Signaling and metabolic pathways https://reactome.org/
GTEx (Genotype-Tissue Expression) [27] Tissue-specific gene expression https://gtexportal.org/home/
STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) [29] Protein-protein interaction networks https://string-db.org/
PathBank[43] Metabolic, signaling, disease, drug, and physiological pathways https://www.pathbank.org/
Pathway Commons[8] Biological pathway and interactions: biochemical reactions; gene regulatory networks; protein, nucleic acid, small molecule interactions https://www.pathwaycommons.org/
BioCyc[18] Metabolic pathways, regulatory networks https://biocyc.org/
WikiPathways[36] Signaling pathways https://www.wikipathways.org/
GRNdb[13] Gene regulatory networks among transcription factors and genes http://www.grndb.com/
miRTarBase[16] miRNA-target interactions https://mirtarbase.cuhk.edu.cn/~miRTarBase/miRTarBase_2022/php/index.php
GRAND[7] Gene regulatory networks among transcription factors, miRNAs and genes across biological states https://grand.networkmedicine.org/
RegNetwork[26] Gene regulatory networks among transcription factors, miRNAs and genes https://regnetworkweb.org/
BioGRID (Biological General Repository for Interaction Datasets) [33] Protein and genetic interactions https://thebiogrid.org/

Recent methodological research has provided strong evidence that knowledge-guided learning methods for integrative analysis of multi-omics data outperform their counterparts that do not use graph information on both supervised learning and unsupervised learning tasks. The results from knowledge-guided learning methods are more biologically meaningful and interpretable, and provide insights into the molecular mechanisms underpinning complex diseases such as cancer and AD. Knowledge-guided supervised learning methods can be used to construct prediction models for disease risk and progression, and identify important features that are highly associated with clinical end points. Knowledge-guided unsupervised learning methods can be used to identify disease subtypes and perform dimensionality reduction. The approaches utilized by the methods in this review to incorporate prior biological knowledge can be classified into three categories: Bayesian, frequentist, and deep learning. The Bayesian approach permits the use of biological knowledge by specifying prior distributions and producing posterior distributions for parameters in the model. The frequentist approach incorporates biological knowledge through penalty functions and yields point estimates for parameters. Deep learning approaches utilize prior biological knowledge by including graphs constructed based on the biological information (e.g. graph convolutional neural network) or imposing network constraints (e.g. autoencoder). Fig. 1 summaries the key components in the three approaches. Additionally, the applications of these methods can be grouped into three categories: prediction and feature selection, clustering, and dimension reduction. Fig. 2 provides a schematic representation of the knowledge-guided multi-omics data integration methods grouped according to their applications.

Fig. 1.

Fig. 1

Knowledge-guided multi-omics data integration methodology.

Fig. 2.

Fig. 2

Overview of knowledge-guided multi-omics data integration methods. The methods are grouped based on their applications and are color coded as per their approaches.

In this review, we survey the recent advances in knowledge-guided learning methods for integrative analysis of multi-omics data. We organize our presentation by application: prediction and feature selection, clustering, and dimension reduction. Note that we review knowledge-guided statistical and deep learning methods for multi-omics data integration published since 2018, while Zhao et al. [47] focuses on knowledge-guided statistical methods for analysis of single-omics data published by 2019.

2. Prediction and feature selection

Several knowledge-guided learning methods have been proposed to build prediction rules for clinical outcomes and identify important features that are highly associated with clinical end points. The outcomes include, but are not limited to, disease status, progression-free interval, survival time, and prognosis.

Wang et al. [41] developed a Bayesian framework that integrates multi-omics data and biological knowledge to infer risk genes driving genome-wide association studies (GWAS) signals. Their method probabilistically ranked genes at each GWAS locus based on supporting evidence from multi-omics data and correlation among the genes. The correlation of genes was implicitly derived from the gene network, with the rationale that disease genes are more densely connected and are therefore more highly correlated. Their method was applied to schizophrenia GWAS data. A generic gene–gene network was constructed from Gene Ontology [4]. They identified a set of high-confidence risk genes that are significantly enriched for heritability and are also enriched in targets of approved drugs. Later, Kim et al. [19] proposed an integrative directed random walk-based method utilizing biological knowledge for more effective feature selection and prediction. They designed a directed gene-gene graph for gene expression and copy number data using the biological information from the KEGG database [17]. Then, the integrative directed random walk was applied to the gene-gene graph. When applied to multiple genomic profiles for breast cancer and neuroblastoma, their method identified biologically significant pathways and genes that are highly correlated with cancer and had superior survival prediction performance compared to several state-of-the-art methods.

More recently, Ma and Zhang [28] proposed a multi-view autoencoder model with network constraints that can simultaneously integrate multi-omics data and biological knowledge. The biological knowledge was incorporated into the model as inductive biases to increase model generalizability. They employed a graph Laplacian regularizer to encode the biological network into the model architecture. The regularizer reduces the inconsistency between the learned feature representation and the biological network. They applied the model to bladder urothelial carcinoma and brain lower grade glioma datasets as well as The Cancer Genome Atlas (TCGA) pan-cancer dataset. These datasets contain gene expression, miRNA expression, protein expression, and DNA methylation and clinical data. The biological network information was obtained from STRING [29]. Their method outperformed traditional methods and conventional deep learning models on predicting clinical outcomes (e.g., progression-free interval and overall survival) from multi-omics data. Later, Althubaiti et al. [3] developed a computational model using graph convolutional neural networks built upon a graph constructed from prior knowledge of the functional interactions between genes and their products. In the graph, nodes represent genes, transcripts, and proteins, and edges between nodes represent functional interactions between them. They designed a set of mapping functions to map the information from the multi-omics data to nodes in this graph. A graph convolutional neural network combined with Cox regression was then used to predict patient survival time. They applied the model to multi-omics cancer data from TCGA [42]. The biological information was obtained STRING [29]. Their method predicted survival time for individual patient samples and outperformed most existing survival prediction methods. They also identified genes that have been demonstrated to be closely related to cancer survival. Zhao et al. [46] proposed a scalable and interpretable multi-omics feed-forward neural network framework that enables the non-linear combination of variables from different omics datasets and incorporates prior biological information. They used multi-omics data at the gene-level as the input data to the gene layer. The gene layer nodes were connected with a functional module layer according to the prior biological information. Each node in the functional module layer was a non-linear function of the values at the different molecular levels of the genes it contained. By incorporating prior biological knowledge, their model could extract significant modules to understand the underlying mechanisms for diseases. Their method was applied to multi-omics and survival time data from TCGA [42]. The biological information was obtained from KEGG [17] and Reactome [10]. Benchmark experiments showed that their framework outperforms other cutting-edge methods for integrating multi-omics data and predicting the survival time. In the case study of lower grade glioma, they identified functional pathways associated with prognosis groups that have been confirmed by previous studies. Thus, knowledge-guided deep learning methods can identify complex non-linear patterns in data while also being efficient for processing large volumes of multi-omics data.

3. Clustering

The goal of clustering is to group patients based on their similarities. It has been widely used to uncover disease subtypes, which is important for developing tailored medicine and providing more precise treatment for individual patients. There are a few knowledge-guided learning methods that incorporate biological knowledge in clustering to improve subtyping accuracy and yield more biologically interpretable results. Li et al. [25] proposed a generalized Bayesian biclustering approach to jointly analyze multi-omics data while incorporating biological information. Their method can handle multiple data types, for example, binomial data such as single nucleotide polymorphism (SNP) data or negative binomial data such as RNA sequencing data. To incorporate biological knowledge, they employed a Bayesian adaptive structured shrinkage prior on the factor loading matrix. The prior encourages one variable to load on a factor if another connected variable has a non-zero loading on the same factor. For example, if two genes are connected in a pathway, they are encouraged to be selected (or not) simultaneously within a bicluster. Therefore, the selected feature set in each bicluster tends to include gene pathways rather than individual genes, resulting in biologically more meaningful results. They conducted biclustering analysis using microarray gene expression data, DNA methylation data, and DNA copy number data from a TCGA study in glioblastoma multiforme. The biological information was obtained from the KEGG database [17]. The higher correlation between subgroups identified by their method and patient survival time compared to other biclustering methods suggested that the clusters detected by their method are more clinically meaningful. Later, Zhang et al. [44] proposed a novel constrained Wishart prior to incorporate biological knowledge in biclustering analysis. The prior encourages the simultaneous selection or non-selection of connected features within a bicluster. In addition, their method effectively addresses the diagonal-dominant issue of the graph-incorporated prior in Li et al. [25], and can handle scenarios involving larger network sizes or constructed networks containing noise. Simulation studies showed that their method outperformed existing biclustering methods including Li et al. [25]. When they conducted biclustering analysis using SNPs and gene expression data from ADNI, their method yielded the best clustering performance. Furthermore, all enriched pathways identified by their method had previously been demonstrated to be closely related to AD.

In addition, Lemsara et al. [21] proposed a multi-modal sparse denoising autoencoder framework that allows for the incorporation of biological knowledge to cluster patients. They mapped the multi-omics features to pathways and estimated a per-patient score for each pathway via multi-modal sparse denoising autoencoders. Then, they combined scores of multiple pathways into a profile for each patient, which could then be used to cluster patients. They applied the framework to cluster patients in several cancer datasets from TCGA [42] using gene expression, miRNA expression, DNA methylation and copy number variation. The biological information was obtained from Nature Pathway Interaction Database [38]. Their method identified biologically plausible disease subtypes and showed competitive clustering performance compared with several competing methods.

4. Dimension reduction

Projecting high-dimensional multi-omics features into a low dimensional space allows for better understanding and visualization of the structure of the data and helps to assess relationships among multiple data sets. Several knowledge-guided learning methods have been proposed for dimensionality reduction in multi-omics data.

Factor analysis is a popular tool for modeling individual and shared structures in multi-omics data. It infers a lower number of unobserved variables called latent factors that capture the majority of the variation in the original high-dimensional data. Min et al. [30] developed a generalized Bayesian factor analysis framework that can jointly analyze multi-omics data and incorporate biological information. They employed the spike and slab lasso prior to impose sparsity on the factor loadings and the Markov random field prior to incorporate network information. The priors encourage the connected variables to share common factors. They applied the model to transcript profile data, mRNA expression data and proteomics profiling data from NCI-60 cell lines. The biological information was obtained from the KEGG database [17]. The application results showed that their method could deliver more biologically meaningful outcomes than methods that do not incorporate graph information. Later, Bao et al. [5] proposed a hierarchical structural Bayesian factor analysis model that successfully incorporates prior biological information without suffering the phase transition problem in Min et al. [30]. To incorporate biological knowledge, they employed a Bayesian adaptive structured shrinkage prior on the factor loading matrix. The prior encourages connected variables to share common factors. In addition, their method can handle both continuous data (e.g. gene expression data) and discrete data (e.g. SNPs) simultaneously. They used the latent factors learned through integrative analysis of the genotyping, gene expression, brain regional level amyloid deposition data from the ADNI database to predict cognitive score. The gene–gene interaction network and brain functional network were obtained from Greene et al. [15] and Glasser et al. [14], respectively. Their method achieved the best prediction accuracy compared with the other state-of-the-art factor-analysis-based methods including Min et al. [30]. More recently, Zhang et al. [45] proposed a novel constrained Wishart prior to incorporate the biological graph knowledge in factor analysis. Their method effectively addresses the diagonal-dominant issue of the graph-incorporated prior in Bao et al. [5], and it is robust to noisy edges that are inconsistent with the structure of the factor loadings. Their method encourages connected variables to share common factors. They applied the factor model to microarray gene expression data, DNA methylation data and DNA copy number data from a TCGA study in glioblastoma multiforme. The biological information was obtained from the KEGG database [17]. Their method outperformed the existing competitors including Min et al. [30] when the learned factors were used in survival analysis, and it detected genes that play important roles in the different subtypes of glioblastoma multiforme.

Canonical correlation analysis and coinertia analysis are multivariate statistical methods frequently used in integrative analysis and have become popular in analysis of multi-omics data. Safo et al. [37] developed statistical methods for sparse canonical correlation analysis that incorporate biological information. To do this, they extended and investigated two types of network-based penalties: the grouped penalty by Pan et al. [34] and the fused lasso penalty by Tibshirani et al. [40]. Their method utilizes biological information such as gene and metabolomic networks to guide selection of important metabolites and transcripts. They conducted integrative analysis of the transcriptomic and metabolomic data from the Predictive Health Institute study. The gene and metabolomic network information was obtained from KEGG [17] and mummichog [23], respectively. Their method identified a number of gene and metabolic pathways that are known to be associated with cardiovascular diseases. As with canonical correlation analysis, biological knowledge has also been incorporated into coinertia analysis. Min et al. [32] proposed a coinertia analysis method that incorporates biological knowledge to assess dependence between two -omics data sets. To incorporate biological knowledge, they adopted the Laplacian penalty function proposed by Li and Li [22]. The penalty function encourages connected variables to be selected or not selected together. Simulation studies demonstrated that their method achieved the best or close to the best performance compared to the existing co-inertia analysis methods. Their method was applied to the integrative analysis of gene expression and protein abundance data from NCI-60 cancer cell lines. The biological information was obtained from the KEGG database [17]. Their method identified biologically meaningful genes and proteins for cancer. Later, Min and Long [31] extended the framework in Min et al. [32] to multiple co-inertia analysis, which can assess relationships and trends in multiple datasets. They employed the Laplacian penalty to incorporate biological information. The connected variables were encouraged to be selected or excluded at the same time. Their method was applied to two gene expression datasets and one protein abundance dataset from NCI-60 cell line data. The biological information was obtained from KEGG database [17]. They projected the high-dimensional -omics features to a lower dimensional space and identified a subset of biomarkers that are suggested in the literature to be related with cancer disease.

5. Conclusion

The knowledge-guided approach for data integration is a powerful strategy to analysis of multi-omics data in modern biomedical research. This paper reviews some of the recent developments in this space. Although knowledge-guided methods have been shown to yield more biologically meaningful and interpretable results than those that do not use graph information, there is still ample room for methodological development and improvement. One future direction is to account for the noise in the biological knowledge represented by a graph. The graphs extracted from existing databases or relying on subject matter expertise are known to be incomplete and may contain false edges. To our best knowledge, existing knowledge-guided methods have largely ignored the important issue of network misspecification and routinely use the given network directly in their models without accounting for noise. We note that while the recent work by [48] addressed a related problem for handling missing edges in only part of the graph via a multiple-imputation approach, their method is limited in scope and cannot be directly applied to diverse and general settings involving noisy networks with varying degrees of misspecification. Li et al. [24] used a latent scale model to account for network noise in graph-guided Bayesian modeling of structured data, which could potentially be extended to integrative analysis of multi-omics data. Another area for future research is to develop computationally efficient algorithms. Most of the existing knowledge-guided Bayesian methods may not be scalable to analysis of ultra-high-dimensional -omics data that may include hundreds of thousands or even millions of features.

CRediT authorship contribution statement

Wenrui Li: Conceptualization, Data curation, Investigation, Methodology, Supervision, Visualization, Writing – original draft. Jenna Ballard: Data curation, Investigation, Validation, Writing – original draft. Yize Zhao: Writing – review & editing. Qi Long: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by National Institutes of Health grants, RF1 AG063481, R01 AG071174, RF1 AG081413 and RF1 AG068191.

Contributor Information

Wenrui Li, Email: wenrui.li@pennmedicine.upenn.edu.

Qi Long, Email: qlong@upenn.edu.

References

  • 1.2022. Therapeutically applicable research to generate effective treatments (TARGET) - NCI. Archive Location: nciglobal, ncienterprise. [Google Scholar]
  • 2.Aging Atlas Consortium Aging Atlas: a multi-omics database for aging biology. Nucleic Acids Res. 2021;49(D1):D825–D830. doi: 10.1093/nar/gkaa894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Althubaiti S., Kulmanov M., Liu Y., Gkoutos G.V., Schofield P., Hoehndorf R. Deepmocca: a pan-cancer prognostic model identifies personalized prognostic markers through graph attention and multi-omics data integration. bioRxiv. 2021 [Google Scholar]
  • 4.Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bao J., Chang C., Zhang Q., Saykin A.J., Shen L., Long Q., et al. Integrative analysis of multi-omics and imaging data with incorporation of biological information via structural Bayesian factor analysis. Brief Bioinform. 2023;24(2) doi: 10.1093/bib/bbad073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Barretina J., Caponigro G., Stransky N., Venkatesan K., Margolin A.A., Kim S., et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483(7391):603–607. doi: 10.1038/nature11003. Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ben Guebila M., Lopes-Ramos C.M., Weighill D., Sonawane A.R., Burkholz R., Shamsaei B., et al. GRAND: a database of gene regulatory network models across human conditions. Nucleic Acids Res. 2022;50(D1):D610–D621. doi: 10.1093/nar/gkab778. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Cerami E.G., Gross B.E., Demir E., Rodchenkov I., Babur Ã., Anwar N., et al. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 2011;39(suppl_1):D685–D690. doi: 10.1093/nar/gkq1039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Clark C., Dayon L., Masoodi M., Bowman G.L., Popp J. An integrative multi-omics approach reveals new central nervous system pathway alterations in Alzheimer's disease. Alzheimer's Res Ther. 2021;13(1):1–19. doi: 10.1186/s13195-021-00814-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Croft D., O'Kelly G., Wu G., Haw R., Gillespie M., Matthews L., et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011;39(suppl_1):D691–D697. doi: 10.1093/nar/gkq1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Curtis C., Shah S.P., Chin S.-F., Turashvili G., Rueda O.M., Dunning M.J., et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346–352. doi: 10.1038/nature10983. Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Edwards N.J., Oberti M., Thangudu R.R., Cai S., McGarvey P.B., Jacob S., et al. The CPTAC data portal: a resource for cancer proteomics research. J Proteome Res. 2015;14(6):2707–2713. doi: 10.1021/pr501254j. Publisher: American Chemical Society. [DOI] [PubMed] [Google Scholar]
  • 13.Fang L., Li Y., Ma L., Xu Q., Tan F., Chen G. GRNdb: decoding the gene regulatory networks in diverse human and mouse conditions. Nucleic Acids Res. 2021;49(D1):D97–D103. doi: 10.1093/nar/gkaa995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Glasser M.F., Coalson T.S., Robinson E.C., Hacker C.D., Harwell J., Yacoub E., et al. A multi-modal parcellation of human cerebral cortex. Nature. 2016;536(7615):171–178. doi: 10.1038/nature18933. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Greene C.S., Krishnan A., Wong A.K., Ricciotti E., Zelaya R.A., Himmelstein D.S., et al. Understanding multicellular function and disease with human tissue-specific networks. Nat Genet. 2015;47(6):569–576. doi: 10.1038/ng.3259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hsu S.-D., Lin F.-M., Wu W.-Y., Liang C., Huang W.-C., Chan W.-L., et al. miRTarBase: a database curates experimentally validated microRNA–target interactions. Nucleic Acids Res. 2011;39(suppl_1):D163–D169. doi: 10.1093/nar/gkq1107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kanehisa M., Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Karp P.D., Billington R., Caspi R., Fulcher C.A., Latendresse M., Kothari A., et al. The BioCyc collection of microbial genomes and metabolic pathways. Brief Bioinform. 2019;20(4):1085–1093. doi: 10.1093/bib/bbx085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kim S.Y., Jeong H.-H., Kim J., Moon J.-H., Sohn K.-A. Robust pathway-based multi-omics data integration using directed random walks for survival prediction in multiple cancer studies. Biol Direct. 2019;14 doi: 10.1186/s13062-019-0239-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Komaki S., Shiwa Y., Furukawa R., Hachiya T., Ohmomo H., Otomo R., et al. iMETHYL: an integrative database of human DNA methylation, gene expression, and genomic variation. Hum Genome Var. 2018;5 doi: 10.1038/hgv.2018.8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lemsara A., Ouadfel S., Frohlich H. Pathme: pathway based multi-modal sparse autoencoders for clustering of patient-level multi-omics data. BMC Bioinform. 2020;21 doi: 10.1186/s12859-020-3465-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Li C., Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24(9):1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]
  • 23.Li S., Park Y., Duraisingham S., Strobel F.H., Khan N., Soltow Q.A., et al. Predicting network activity from high throughput metabolomics. PLoS Comput Biol. 2013;9(7) doi: 10.1371/journal.pcbi.1003123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Li W., Chang C., Kundu S., Long Q. Accounting for network noise in graph-guided Bayesian modeling of structured high-dimensional data. Biometrics. 2024;80(1) doi: 10.1093/biomtc/ujae012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Li Z., Chang C., Kundu S., Long Q. Bayesian generalized biclustering analysis via adaptive structured shrinkage. Biostatistics. 2020;21(3):610–624. doi: 10.1093/biostatistics/kxy081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Liu Z.-P., Wu C., Miao H., Wu H. RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database. 2015;2015 doi: 10.1093/database/bav095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Lonsdale J., Thomas J., Salvatore M., Phillips R., Lo E., Shad S., et al. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013;45(6):580–585. doi: 10.1038/ng.2653. Number: 6 Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Ma T., Zhang A. Integrate multi-omics data with biological interaction networks using multi-view factorization autoencoder (mae) BMC Genomics. 2019;20 doi: 10.1186/s12864-019-6285-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Mering C.v., Huynen M., Jaeggi D., Schmidt S., Bork P., Snel B. STRING: a database of predicted functional associations between proteins. Nucleic Acids Res. 2003;31(1):258–261. doi: 10.1093/nar/gkg034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Min E.J., Chang C., Long Q. 2018 IEEE 5th international conference on data science and advanced analytics (DSAA) IEEE; 2018. Generalized Bayesian factor analysis for integrative clustering with applications to multi-omics data; pp. 109–119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Min E.J., Long Q. Sparse multiple co-inertia analysis with application to integrative analysis of multi-omics data. BMC Bioinform. 2020;21:1–12. doi: 10.1186/s12859-020-3455-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Min E.J., Safo S.E., Long Q. Penalized co-inertia analysis with applications to-omics data. Bioinformatics. 2019;35(6):1018–1025. doi: 10.1093/bioinformatics/bty726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Oughtred R., Rust J., Chang C., Breitkreutz B., Stark C., Willems A., et al. The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci, Publ Protein Soc. 2021;30(1):187–200. doi: 10.1002/pro.3978. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Pan W., Xie B., Shen X. Incorporating predictor network in penalized regression with application to microarray data. Biometrics. 2010;66(2):474–484. doi: 10.1111/j.1541-0420.2009.01296.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Petersen R.C., Aisen P.S., Beckett L.A., Donohue M.C., Gamst A.C., Harvey D.J., et al. Alzheimer's disease neuroimaging initiative (ADNI) Neurology. 2010;74(3):201–209. doi: 10.1212/WNL.0b013e3181cb3e25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Pico A.R., Kelder T., Iersel M.P.v., Hanspers K., Conklin B.R., Evelo C. WikiPathways: pathway editing for the people. PLoS Biol. 2008;6(7):e184. doi: 10.1371/journal.pbio.0060184. Publisher: Public Library of Science. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Safo S.E., Li S., Long Q. Integrative analysis of transcriptomic and metabolomic data via sparse canonical correlation analysis with incorporation of biological information. Biometrics. 2018;74(1):300–312. doi: 10.1111/biom.12715. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Schaefer C.F., Anthony K., Krupa S., Buchoff J., Day M., Hannay T., et al. Pid: the pathway interaction database. Nucleic Acids Res. 2009;37(suppl_1):D674–D679. doi: 10.1093/nar/gkn653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Tang Z., Fan W., Li Q., Wang D., Wen M., Wang J., et al. MVIP: multi-omics portal of viral infection. Nucleic Acids Res. 2022;50(D1):D817–D827. doi: 10.1093/nar/gkab958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Tibshirani R., Saunders M., Rosset S., Zhu J., Knight K. Sparsity and smoothness via the fused lasso. J R Stat Soc, Ser B, Stat Methodol. 2005;67(1):91–108. [Google Scholar]
  • 41.Wang Q., Chen R., Cheng F., Wei Q., Ji Y., Yang H., et al. A Bayesian framework that integrates multi-omics data and gene networks predicts risk genes from schizophrenia gwas data. Nat Neurosci. 2019;22 doi: 10.1038/s41593-019-0382-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Weinstein J.N., Collisson E.A., Mills G.B., Shaw K.R.M., Ozenberger B.A., Ellrott K., et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113–1120. doi: 10.1038/ng.2764. Number: 10 Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Wishart D.S., Li C., Marcu A., Badran H., Pon A., Budinski Z., et al. PathBank: a comprehensive pathway database for model organisms. Nucleic Acids Res. 2020;48(D1):D470–D478. doi: 10.1093/nar/gkz861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Zhang Q., Chang C., Long Q. Robust knowledge-guided biclustering for multi-omics data. Brief Bioinform. 2024;25(1) doi: 10.1093/bib/bbad446. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Zhang Q., Chang C., Shen L., Long Q. Incorporating graph information in Bayesian factor analysis with robust and adaptive shrinkage priors. Biometrics. 2024;80(1) doi: 10.1093/biomtc/ujad014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Zhao L., Dong Q., Luo C., Wu Y., Bu D., Qi X., et al. Deepomix: a scalable and interpretable multi-omics deep learning framework and application in cancer survival analysis. Comput Struct Biotechnol J. 2021;19 doi: 10.1016/j.csbj.2021.04.067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Zhao Y., Chang C., Long Q. Knowledge-guided statistical learning methods for analysis of high-dimensional-omics data in precision oncology. JCO Precis Oncol. 2019;3:1–9. doi: 10.1200/PO.19.00018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Zhao Y., Chung M., Johnson B.A., Moreno C.S., Long Q. Hierarchical feature selection incorporating known and novel biological information: identifying genomic features related to prostate cancer recurrence. J Am Stat Assoc. 2016;111(516):1427–1439. doi: 10.1080/01621459.2016.1164051. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Computational and Structural Biotechnology Journal are provided here courtesy of Research Network of Computational and Structural Biotechnology

RESOURCES