Abstract
The integration of multi-omics data with detailed phenotypic insights from electronic health records (EHRs) marks a paradigm shift in biomedical research, offering unparalleled holistic views into health and disease pathways. This review delineates the current landscape of multi-modal omics data integration, emphasizing its transformative potential in generating a comprehensive understanding of complex biological systems. We explore robust methodologies for data integration, ranging from concatenation-based to transformation-based and network-based strategies, designed to harness the intricate nuances of diverse data types. Our discussion extends from incorporating large-scale population biobanks to dissecting high-dimensional omics layers at the single-cell level. The review underscores the emerging role of large language models in artificial intelligence, anticipating their influence as a near-future pivot in data integration approaches. Highlighting both achievements and hurdles, we advocate for a concerted effort towards sophisticated integration models, fortifying the foundation for groundbreaking discoveries in precision medicine.
Keywords: Machine Learning, Multi-Omics Data Integration, Multi-Modal Data Integration, Precision Medicine, Single-cell Omics, Longitudinal Analysis, Imaging Phenotypes, Biobank, Risk Assessment
1. Introduction
The landscape of biological research has undergone a remarkable transformation, courtesy of breakthroughs in high-throughput techniques that have ushered in the era of ‘omics’ science. Spanning across genomics, epigenomics, transcriptomics, proteomics, and metabolomics, the realm of omics offers a kaleidoscopic view of the constituent elements that make up human biology. These layers, each a unique facet of the broader biological puzzle, have long been the subject of unimodal analyses, yielding critical insights into discrete regulatory mechanisms within our biological systems.
However, the advent of sophisticated artificial intelligence (AI), particularly through deep learning and machine learning algorithms, marks a seminal point in our scientific journey. Traditional machine learning frameworks, such as support vector machines and random forests, laid the groundwork for integrating and interpreting the tapestry of multi-omics data. The rise of advanced paradigms—convolutional neural networks, graph neural networks, and recurrent neural networks—has exponentially amplified our capabilities, enabling the decoding of complex patterns and nonlinear relationships sprawling across multi-omics data.
This revolution in data analytics stands at the cusp of a new frontier: the comprehensive and holistic exploration of biological processes. By harnessing the singular attributes and synergistic interactions across various omics strata, machine learning serves as the linchpin in our pursuit of a deeper understanding of life’s fundamental mechanisms, thereby paving the way for innovations in precision medicine.
Significant strides have been made, transitioning from mono-omics to sophisticated multi-omics analyses, spurred by the maturation of data integration techniques and an ever-expanding repository of biological intelligence. Concurrently, the emergence of diverse phenotypic data from electronic health records (EHRs) has added nuanced dimensions that are intimately tied to patient health narratives—ranging from clinical diagnoses and therapeutic histories to vital parameters and advanced imaging findings.
The convergence of multi-layered omics data with rich phenotypic breadcrumbs from EHRs signifies an exciting convergence in biomedical research. Particularly, the systematic amalgamation of large-scale, multi-omics data with nuanced, patient-centric information from EHRs promises to unlock previously cryptic corners of human biology. By shedding light on the subtle dance between molecular constituents and clinical phenomenology, this integrative approach heralds a transformative era in precision medicine.
In this review, we embark on a journey through the dynamic evolution, inherent challenges, innovative methodologies, and untapped potential residing within the realm of multi-modal omics data integration. Standing on the brink of this scientific odyssey, we are witnesses to a nascent revolution in integrative biomedical analysis, one that holds the promise to redefine our understanding of health and the therapeutic strategies of tomorrow.
2. Opportunities in integrative analysis of multi-modal omics data
With the precipitous decline in the costs associated with high-throughput sequencing and other massively paralleled biomolecular technologies, a novel window of opportunity has opened up, offering enhanced accessibility to these advanced tools (1). This progression facilitates their seamless integration into both clinical research and practical applications. In this review, we distinguish between traditional and recent multi-omics analysis, emphasizing the emerging opportunities in multi-modal omics data integration. Specifically, we highlight how the expanded scale and improved resolution of samples now available are forging new frontiers for multi-omics data acquisition.
Conventional multi-omics approaches largely center on the analysis of bulk samples. Although these methodologies have been instrumental in providing insights into specific tissues or cell populations and in furthering our understanding of dominant pathways or disease pathogenesis, they inherently average data across numerous cells. Consequently, while notable achievements have been made in describing a broad overview of biological processes, integrative analysis with these traditional techniques often falls short when it comes to deciphering the intricate cellular heterogeneity within samples. Moreover, these approaches may not capture subtle yet critical differences at the population level, such as variation/diversity in biological pathways across races or genders. Instead, the remaining gaps in our comprehension of human biology could potentially be bridged with the integrated analysis of emerging multi-omics datasets—specifically, multi-modal omics data sourced from large-scale biobanks and high-resolution single-cell multi-omics data.
2.1. Multi-modal omics data integration analysis at the population level
Large-scale, national, or community-based biobanks have started to incorporate heterogeneous phenotypic data derived from EHRs alongside conventional multi-omics data (2) (Table 1). Biobank-scaled multi-omics data integration is transformative in the sense that it offers a shift from random bulk samples to organized population-level datasets. This shift allows for better representation of individual characteristics, which represent a broader spectrum of the population, incorporating variation due to factors like age, ethnicity, lifestyle, and more. This improved representation lends itself to a more holistic and real-world applicable understanding of human biology, particularly unveiling new insights into disease risk and manifestation. Notably, there is a significant advantage in integrating omics data with the observable traits or characteristics of individuals, in that doing so enables determination the etiology or biomarkers for specific diseases (3).
Table 1.
Multi-Omics Projects in the National and Community-Based Cohorts for Biomedical Research: List of omics-specific projects and consortiums were listed in Supplementary Table 1.
| Biobank | Project Characteristics | Omics Data | References | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Year of launch | Genomics | Epigenetics | Transcriptomics | Proteomics | Metabolomics | |||||
| ✔ | ✔ | ✔ | ✔ | ✔ | (13, 14) | |||||
| IPOP | 2012 | ✔ | ✔ | ✔ | ✔ | ✔ | (15, 16) | |||
| ✔ | ✔ | ✔ | (17–20) | |||||||
| ✔ | ✔ | ✔ | ✔ | (21–25) | ||||||
| ✔ | ✔ | ✔ | ✔ | (26–28) | ||||||
| ✔ | ✔ | ✔ | ✔ | (29–31) | ||||||
| Estonian Biobank | 2002 | ✔ | ✔ | (32, 33) | ||||||
| ✔ | ✔ | ✔ | - | |||||||
| ✔ | ✔ | ✔ | (34) | |||||||
| ✔ | ✔ | ✔ | (35) | |||||||
| ✔ | ✔ | ✔ | (36) | |||||||
| ✔ | ✔ | ✔ | ✔ | ✔ | (37, 38) | |||||
| ✔ | ✔ | (39, 40) | ||||||||
| KoCAS | 2005 | ✔ | ✔ | (41, 42) | ||||||
| ✔ | ✔ | ✔ | (43, 44) | |||||||
| SPHS | 2003 | ✔ | ✔ | ✔ | (45) | |||||
Abbreviations: USA, United States of America; UK, United Kingdom; TOPMed, Trans-Omics for Precision Medicine; IPOP, Integrated Personal ‘Omics Project; BIOS, Biobank-based integrative omics study; jMorp, Japanese Multi-Omics Reference Panel; GNHS, Guangzhou Nutrition and Health Study; KoGES, Korean Genome and Epidemiology Study; KoCAS, Korean Children-Adolescent Cohort Study; CHAIN, The Childhood Acute Illness and Nutrition; SPHS, Singapore Population Health Study.
2.1.1. Exploring disease etiology with multi-omics at the population level.
Integration of multi-omics and phenotypic data from biobanks enables the development of molecular-level disease understanding at the population level. Biomarkers have unique advantages in providing early assessment of patient health, prognostic efficacy analysis, and accurate disease staging and typing. Biomarkers discovered through the incorporation of EHR and omics data can provide guidance and evidence for disease diagnosis and prognosis in clinical use.
2.1.2. Longitudinal multi-modal omics integration.
Integrated analysis of phenotypic and multi-omics data facilitates longitudinal integration of multi-modal omics data, an emerging approach that combines data collected over extended periods from the same samples (4). This longitudinal perspective can reveal how biological systems evolve over time, highlighting trends, patterns, and associations that might not be apparent from cross-sectional studies. Applications of longitudinal multi-omics integration include studying disease progression, identifying biomarkers for early diagnosis, understanding the effects of therapies, and uncovering previously hidden relationships between different biological layers (5). A longitudinal approach can thus offer a holistic view of the underlying mechanisms driving health, disease, and responses to treatments.
2.1.3. Incorporating heterogeneous clinical data with multi-omics.
The integration of omics data with diverse clinical information, including medical images, has become increasingly feasible. Imaging techniques such as computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET) offer detailed insights into the structure and function of tissues and organs within the body. To effectively analyze the nuanced interplay between omics data and medical images, it is imperative that both kinds of data be sourced from the same sample. However, traditional multi-omics data analyses often don’t support such cross-examinations unless structured within meticulously designed cohorts. This challenge comes from the need for highly structured and meticulously designed cohorts to guarantee matching samples for comparison. Biobanks present a solution as they collect both imaging phenotypes and omics data from the same individuals, thereby enabling more straightforward combined analysis.
In addition to integrative and longitudinal analyses with heterogeneous phenotypic data, a significant advantage of multi-omics approaches is the ability to extract traditional single omics data within a population-level level. This feature is particularly beneficial for integrative genetics research, allowing for the use of diverse methodologies. Multi-omics approaches lies in their ability to synthesize and interpret these data types collectively, enhancing our understanding of complex biological systems and phenotypes:
Epigenome-wide association studies (EWAS) focus on the integration of epigenomic data, such as DNA methylation and histone modifications, with clinical phenotypes. This approach is particularly useful for investigating the impact of epigenetic modifications on disease susceptibility and progression (6). EWAS can uncover epigenetic markers associated with diseases, shedding light on how environmental factors interact with an individual’s genetic makeup to influence gene expression patterns.
Transcription-wide association studies (TWAS) involve the integration of transcriptomics and genome-wide association study (GWAS) data. By leveraging gene expression information from tissues or cell types relevant to a specific trait, derived from sources like GTEx, PsychENCODE, and CommonMind, TWAS can identify genes whose expression levels are associated with a trait of interest (7–9).
Proteome-wide association studies (PWAS) have emerged to explore the connections between protein profiles and health conditions. PWAS identifies associations between changes in protein expression levels and specific health outcomes, which offers insights into complex disorders and provides potential biomarkers and treatment targets for precision medicine (10, 11).
Metabolome-wide association studies (MtWAS) explore the connection between metabolites—small molecules involved in biochemical processes—and traits or diseases. Integrating metabolomic data with genetic information can reveal metabolic pathways and networks that are linked to specific physiological conditions (12).
2.2. High-resolution multi-omics data integration analysis from single cells
While biobanks offer scale and better population representation, single-cell omics dives deeper into specific biological and cellular mechanisms, enhancing the resolution to the level of individual cells. That is, instead of understanding a cell population as a homogenous entity and aggregating data from millions of cells, single-cell analyses recognize and chart the heterogeneity within, providing a detailed, cell-specific landscape that reveals cellular intricacies often overshadowed in bulk analyses. By comprehensively profiling biological molecules—such as DNA (genomics), RNA (transcriptomics), proteins (proteomics), and metabolites (metabolomics)—from individual cells, single-cell multi-omics data elucidate the unique molecular signature of each cell (46). Understanding the heterogeneity within cell populations is crucial for understanding complex biological processes, disease mechanisms, and developmental pathways; most of all, it is essential for understanding the comprehensive interplay of molecular processes in health and disease. In this review, we introduce single-cell omics technology by highlighting its utilization of various omics and discuss opportunities for integration analysis accordingly.
2.2.1. Integration of genomic and transcriptomic data
Genome and Transcriptome Sequencing (G&T-seq) simultaneously measures genomic DNA and mRNA from single cells, elucidating the genetic makeup, variations, and gene expression profile (47). gDNA-mRNA sequencing (DR-seq) likewise investigates the relationship between genome and transcriptome in individual cells, enabling an understanding of cellular behavior and functional characteristics (48). Both of these methods unveil genetic and transcriptomic heterogeneity within tissues, improving precision in the identification of cellular origins of diseases and potential therapeutic targets. Detailed descriptions of each single-cell omics method are provided in the Supplementary Text 1.
2.2.2. Integration of transcriptomic and proteomic data
Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-seq) combines the specificity of antibody detection with next-generation sequencing through tagging antibodies with unique oligonucleotide barcodes to label proteins in cells (49); mRNA from the same cells is sequenced in parallel. This method provides a comprehensive view of cell functions, accounting for differences between protein and mRNA levels. RNA expression and protein sequencing (REAP-seq) is an integrated single-cell sequencing method likewise designed to concurrently profile the transcriptomes and proteomes of individual cells (50), which provides invaluable insights into cellular function and regulation. Both of these methods combine RNA sequencing with protein-level insights in the same cell.
2.2.3. Integration of epigenetic and transcriptomic data
Targeted Epitranscriptomic Amplification sequencing (TEA-seq) focuses on RNA modifications at the single-cell level, elucidating how these modifications interact with other regulatory layers (51). It is tailored to profile targeted modifications of interest rather than to conduct a global survey. Accelerated Single-stranded Adenine Profiling sequencing (ASAP-seq) quantifies nascent RNA molecules and so measures the rate of transcription in single cells (52), offering insights into transcriptional dynamics and gene expression regulation. Together, these methods provide granular insights into RNA modifications and transcription rates at the individual cell level, a resolution not attainable with traditional methods.
2.2.4. Integration of multi-modal single-cell omics with phenotypic data
Integrating single-cell multi-omics with phenotypic data involves associating the detailed molecular profiles of individual cells with their observable characteristics or functions. This comprehensive approach allows for a more holistic understanding of how molecular changes influence cell behavior and the overall organismal phenotype; furthermore, by correlating molecular profiles with phenotypic responses, researchers can better identify therapeutic targets and predict drug responses (53).
3. Machine learning for multi-modal omics data integration analysis
Many researchers have examined the challenges that arise during data analysis of either single omics or multi-omics data and have proposed solutions tailored to each scenario (54, 55). In this review, we briefly introduce representative machine learning approaches for data integration. Subsequently, we provide a more detailed discussion on integrating imaging phenotypes with multi-omics and the analysis of longitudinal multi-omics data.
3.1. Integration strategies for utilizing machine learning with multi-modal data
Analyzing multiple omics data presents significant challenges due to the inherent data heterogeneity that arises from the diverse techniques and platforms used for generating different omics data types. These datasets can typically be visualized as structured matrices, with rows representing samples and columns representing the biological features specific to each omics category. Tasks may involve classification at the sample level, such as predicting cell types or stratifying patients, or identifying biomarkers at the feature level. In either case, machine learning methodologies provide robust solutions for the integration and interpretation of such intricate data.
A major hurdle with multi-omics data is its high dimensionality, a common characteristic even for data of a single omics type. This problem is further complicated when integrating data from multiple omics sources, as overlapping samples may be lacking. To combat the issue of high dimensionality, researchers often turn to methods for feature selection, such as filter-based, wrapper-based, or embedded methods. Feature extraction techniques like principal component analysis, canonical correlation analysis, non-negative matrix factorization, and auto-encoder also come in handy.
Another layer of complexity arises from issues related to data sparsity and the task of interpreting the results. To weave together and make sense of multi-omics data, researchers employ a spectrum of methodologies ranging from traditional machine learning to state-of-the-art deep learning, all tailored to data characteristics and the objective of the analysis. Here, we discuss three representative integration strategies: concatenation-based integration, transformation-based integration, and network-based integration (Fig. 1).
Figure 1.

Overview of multi-modal omics data integration analysis
Concatenation-based integration stands out as one of the most straightforward strategies for amalgamating multi-omics data. It involves directly combining (or joining) features from different omics datasets through aligning them side-by-side (56–58). To illustrate, if one possesses two matrices of data, one from genomics and another from transcriptomics, pertaining to the same set of samples, the concatenation approach would position the columns of one matrix adjacent to those of the other, culminating in a broader matrix. With such a concatenated feature matrix, we can use any kind of supervised learning for analysis; moreover, this approach is straightforward to implement and understand because the matrix preserves the original features derived from each omics method. Nonetheless, the concatenated matrix often demands supplementary preprocessing steps, such as normalization or batch effect mitigation, as data from different omics sources might operate on different scales or distributions.
Transformation-based integration is a technique used to blend multi-omics data by shifting the data from various omics sources into a shared feature space or framework. By mapping data to these unified spaces, it becomes feasible to identify relationships across different omics layers; hence, this method excels at revealing hidden relationships and patterns within and between omics sets (59). However, in contrast to the direct concatenation-based method, transformation-based integration can obfuscate the understanding of the original data, making it difficult to derive direct biological or clinical insights. Several machine learning methods are frequently employed for transformation-based integration. For example, canonical correlation analysis discerns linear combinations of features from two omics datasets that correlate maximally, making it suitable for two-omics data integration. Multiple kernel learning (MKL) combines several kernels (each corresponding to a different omics data type) to optimize a single learning task. Each kernel gauges similarity within its respective omics type, and MKL fuses them into a consolidated kernel space. Non-negative matrix factorization breaks down data into products of non-negative matrices, and when applied to multi-omics data, reveals shared patterns or meta-genes. Finally, deep learning techniques such as an auto-encoder are trained to jointly represent multi-omics data, with the bottleneck layer encapsulating a compressed, integrated representation of the input datasets.
Network-based integration is used to integrate and analyze multi-omics data in the context of biological networks (60–62); it leverages the structured relationships between biological entities, such as genes, proteins, or metabolites, to provide context and improve the interpretation of multi-omics data. This method essentially visualizes and analyzes data in the form of networks, where nodes represent biological entities and edges represent relationships or interactions between them. Once the multi-omics data are converted to an adjacency or similarity matrix, kernel-based algorithms or graph neural networks can be applied. Thus, network-based integration transforms multi-faceted omics data into a structured, interconnected framework that facilitates a holistic view of biological systems. The interconnected nature of networks naturally complements the intricate relationships within and between omics datasets, offering an intuitive platform for multi-omics data exploration and hypothesis generation.
3.2. Integrative analysis through combining multi-omics and imaging phenotypes
Biobanks that link patient EHRs with omics data have paved the way for developing an unprecedented degree of insight into disease risk and manifestation. As imaging modalities such as CT, MRI, and PET provide detailed insights into the structure and function of tissues and organs, the recent inclusion of medical imaging in EHRs has opened up the potential for finer phenotyping and developing an enhanced understanding of how various structures correlate with disease. Radiomics transforms these routine medical images into quantitative data that can be mined, allowing image features to be analyzed alongside other biological information. However, image features encompass comprehensive characteristics, including pathological conditions, and differ in modality from other data sources; as such, integrated analysis of imaging with omics data demands significant computational effort. Machine learning is essential for the various stages of this integrated analysis, including data preprocessing, feature selection and extraction, designing integration approaches, model selection, and model evaluation. Notably, deep learning-based algorithms are well-suited for handling the complexity of multi-modal data and also for identifying intricate patterns and correlations (63). By building descriptive and predictive models, deep learning methods enable researchers to extract valuable insights from the integrated data, contributing to a more comprehensive understanding of complex diseases and enhancing healthcare decision-making.
Radiogenomics, which specifically integrates medical imaging with genetic data, aims to identify disease mechanisms, predict prognosis, and assess treatment responses through investigating the relationships between imaging features and genetic factors; in particular, it can elucidate genomic risk factors associated with a phenotype of interest and the genetic architecture shared between phenotypes and morphologies. This approach has been applied to phenotypes derived from imaging of specific tissues or organs, including structural and functional brain imaging (64–66), cardiac imaging (67), retinal imaging (68, 69), and lung imaging (70). Beyond genomics, further incorporating diverse multi-modal omics data into such analyses can provide supplementary insights into the pathobiological pathways and disease-associated risk factors. Such integrated studies have been systematically conducted across a broad spectrum of diseases such as autism spectrum disorder, oral diseases, and lung cancer (70, 71). In addition, a metagenomic study of the gut microbiome has revealed associations between protective bacteria, cognitive traits, and brain regional volumes derived from imaging (72). The integration of microbiome, metabolomics, cytokine measurements, cognitive assessments, and brain imaging data in that study provides further evidence of the greater insights that can be obtained by utilizing more data modalities.
3.3. Machine learning for improved phenotyping
Machine learning plays a critical role in enhancing phenotyping accuracy, enabling quantitative measurements to more precisely represent disease-related features. For instance, machine learning algorithms have been employed to predict vertical cup-to-disc ratio, a critical parameter for glaucoma diagnosis. By overcoming the need for costly and time-intensive expert labeling, machine learning facilitates the identification of significant loci; thus, future work in machine-learning-based phenotyping will facilitate omics integration studies (73).
3.4. Longitudinal multimodal omics data integration
Longitudinal multi-omics integration combines data obtained over time from the same samples, shedding light on biological evolution and patterns not seen in cross-sectional studies. It is therefore beneficial in studying disease progression, pinpointing early diagnostic biomarkers, evaluating therapy effects, and understanding interconnected biological mechanisms influencing health and treatment responses. However, longitudinal multi-omics integration also presents numerous challenges that demand sophisticated computational and analytical strategies. Specific to longitudinal data, challenges like temporal misalignment, missing data, and irregular sampling intervals require robust methods for data preprocessing and normalization. Interdependency between multiple time points from the same sample, unbalanced datasets, and unexpected outlier events add to the data complexity. Finally, the time element further amplifies the high-dimensional nature of multi-omics data, an issue compounded by the typically limited number of samples and high sample variability.
Methods developed for longitudinal multi-omics data integration typically have the following components: 1) preprocessing, 2) modeling, and 3) analysis (e.g. predicting a phenotype, identifying omics biomarkers for the progression of a phenotype, extracting clusters of omics features that have similar patterns, etc.). Preprocessing tasks may involve excluding subjects with limited time points or low variation, imputation of missing data, normalization, and feature selection to identify the most informative elements for subsequent analysis. Preprocessing and modeling may be performed sequentially or simultaneously depending on the framework. For example, Bodein et al. developed timeOmics, a longitudinal multi-omics integration framework that performs each component sequentially (74). Preprocessing in this framework includes a fold-change-based filter to focus on time-sensitive molecules; after preprocessing, a linear mixed model spline framework is used to model the expression of each biological feature while considering inter-individual variability. It then uses six unsupervised integration methods to cluster multi-omics expression profiles and identify key molecular features per cluster. Metwally et al. follow a similar structure with their proposed method, OmicsLonDA (75), in which they first input preprocessed data, then apply a Gaussian smoothing spline regression model in a semi-parametric approach to discern significant time intervals of omics features between study groups.
In addition to statistical methods, recurrent neural networks (RNNs) are a prevalent deep learning method for integrating longitudinal multi-omics or multi-modal data. Their effectiveness is clear from their success in other areas such as natural language processing, time series prediction, and speech recognition, just to name a few (76, 77). The advantages of using RNN-based models are three-fold. First, they process sequences by iterating through individual elements while retaining a memory state that captures the context of prior elements. Second, the same parameters are used across different time steps, promoting compactness and generalization. Third, RNNs are naturally equipped to handle sequences of varying lengths, making them flexible and able to address a variety of tasks (78). However, while vanilla RNNs introduced the concept of processing sequences, they struggle with long sequences due to challenges like vanishing and exploding gradients (79). Modern adaptations, such as long short-term memory networks (LSTM) and gated recurrent units (GRUs), were designed to overcome these issues and are now more prevalent in practice.
When applied to longitudinal multi-omics integration, some frameworks using these models combine the preprocessing, modeling, and analysis stages, while others approach them as separate, consecutive steps. Lee et al. used the latter approach by training a unique GRU for each data modality to transform longitudinal data into fixed-size vectors (80). The respective vectors for each modality were then concatenated and merged into an integrated vector, which was used to predict progression from mild cognitive impairment to Alzheimer’s disease (AD) using an l1-regularized logistic regression. The former approach of combining steps is possible because RNNs are able to impute missing values or interpolate between observed time points, creating continuous sequences for further analysis. Nguyen et al. adapted minimalRNN to simultaneously impute missing data and predict AD progression (81), and found it to perform better than typical interpolation methods such as forward and linear filling and other baseline methods. Jung et al. developed a deepRNN framework which combines imputation, encoding, and analysis to predict clinical status trajectories (82).
Ultimately, longitudinal multi-omics integration has the potential to uncover previously hidden relationships between different biological layers and glean insights that remain concealed in cross-sectional studies. However, the path to such insights is strewn with challenges—from the intricacies of temporal data to the high dimensionality of multi-omics data. The advent of deep learning, especially RNNs and their advanced derivatives, offers compelling strategies for managing the complexities inherent to this data. As this field continues to evolve, it will be crucial to refine existing methodologies and develop new tools, ensuring realization of the full potential of longitudinal multi-omics integration for health and disease.
4. Downstream multi-modal omics data integration analyses
In this section, we review several studies for downstream applications that leverage the vast resources of large biobanks to uncover valuable insights into complex traits and diseases. These illustrate the scale and potential of multi-modal omics data integration.
4.1. Biobank-scaled multi-modal omics data integration
Large-scale biobanks have paved the way for extensive multi-modal omics data integration, which in turn enables a deeper understanding of genetic, epigenetic, and metabolic contributions to human health. We explore the key applications of GWAS, EWAS, TWAS, and MtWAS, which incorporate phenotypic data with genetic, epigenetic, transcriptomic, and metabolomics data respectively.
TWAS can identify genes whose expression levels are associated with a particular trait. Integrating TWAS results into polygenic risk scores (PRS) can enhance the predictive power of PRS models, more accurately capturing the functional consequences of genetic variants and hence improving disease risk and other phenotypic outcome predictions. Recent studies on complex neuropsychiatric traits (83), chronic obstructive pulmonary disease (COPD) (84), and other phenotypes have demonstrated improved prediction performance and enhanced cross-ancestry PRS portability (84, 85). EWAS aids in discovering epigenetic markers related to a disease, providing insights into the interaction between environmental factors and individual genetics that influence gene expression. It also highlights the genes, pathways, and molecular mechanisms associated with common and complex traits like attention-deficit hyperactivity disorder (ADHD) (86), speech sound disorders (87), and coronary artery disease (CAD) (88). In addition, EWAS has been instrumental in identifying specific genetic factors that contribute significantly to diseases known for their high polygenic risk burdens, such as ADHD (89) and autism (90). Integrating metabolomic data with genetic information can further unveil metabolic pathways and networks that are tied to specific physiological states; for instance, Kojouri et al. examined the causal associations between physical activity, body mass index, and metabolites (12). More broadly, MtWAS can identify metabolites linked to disease risk, giving a more comprehensive view of disease etiology and potential biomarkers, and has been employed for conditions like depression, chronic kidney disease (CKD), and neovascular age-related macular degeneration (91, 92). Several studies have further inferred the causality of metabolites in human diseases using medallion randomization (93, 94).
Multi-omics projects and consortia focused on specific diseases are also noteworthy (Supplementary Table 2). Exploration of disease etiology at the molecular level increasingly requires multi-omics and computational algorithms. For clinical diseases, disease biomarkers play key guiding roles in diagnosis and prognosis, as they have unique advantages in evaluating early, low-level damage and so provide for early warning, prognostic efficacy analysis, and accurate staging and typing. Although many cancer-associated biomarkers have been identified through single omics, multiple omics approaches may provide enhanced benefits by uncovering biomarkers that are shared across different cancer types. Multi-omics also has many applications in the study of neurodegenerative biomarkers and therapeutic targets. In addition, recent years have seen increased acknowledgment of the human microbiome’s crucial role in health and disease, leading to many microbiome-wide association studies (MWAS) exploring the intricate relationship between the human microbiome and health outcomes; these have revealed potential biomarkers, therapeutic targets, and new pathways (95, 96). Similarly to MtWAS, causal links have been found between microbiome data and diseases like heart failure (97) and COVID-19 severity (98) through Mendelian randomization. These comprehensive repositories shaped by multi-omics-driven data will play a pivotal role in unraveling the complex nuances of various diseases, thereby advancing biomedical research toward the realm of precision health.
4.2. Multi-omics with imaging phenotypes
Medical imaging has vastly improved our understanding of complex diseases, a fact that underscores the importance of transpathology, the combination of molecular imaging and pathology data, in investigating disease mechanisms (99). In transpathology, specific attributes of diseases, such as phenotypic traits, structural abnormalities, and progression patterns, are directly compared to imaging, providing a more comprehensive view of disease-related biological processes. An advantage of this approach is that it provides deeper insights into disease etiology by identifying relationships between molecular imaging and pathology data; disadvantages are that the methods can be computationally intensive and may require prohibitively large datasets to achieve accurate results. Transpathology approaches have been particularly successful in elucidating biological features of neurological diseases, with hundreds of studies connecting brain network activity from imaging studies with the likelihood of developing mental illness at the individual level (100). Recent work has also shown the benefit of integrating imaging data with expression data to learn more about disease. For example, tri-modal transcriptomics, genomics, and imaging data were combined in a federated model to understand the relationships between omics levels and how these biologically important factors relate to AD (101). Similarly, Bao et al. recently integrated genomic data, multiple types of expression data, imaging data from 145 brain regions, and AD status using a co-localization framework to distinguish causal AD pathways (102). Imaging analyses have also been implemented for spatial transcriptomics, enabling a more comprehensive view of molecular processes (103).
4.3. Precision medicine approaches with multi-omics data integration
Precision medicine represents the pinnacle of personalized medicine, in which medical decisions, treatments, and interventions are tailored to each individual based on their unique characteristics (104). Central to this vision is the integration of multi-omics data: joining genomics, transcriptomics, proteomics, metabolomics, and other high-throughput datasets to obtain a holistic view of an individual’s biological makeup. Integrating these vast datasets allows clinicians and researchers to identify complex patterns, making it possible to predict susceptibility to specific diseases, understand disease progression, and create personalized treatment strategies. In essence, multi-omics integration is a key element that will transform the ambitious vision of precision medicine into a tangible reality, ushering in a new era of personalized and predictable healthcare.
With the incorporation of multi-omics data and EHR-derived phenotypic data from large-scale biobanks, numerous studies have been undertaken to stratify or predict patient risk (24). Thompson et al. explored the potential of epigenetic information to enhance phenotype inference in combined biobank-EHR systems by developing a methylation risk scoring model that integrated DNA methylation data and 607 EHR-derived phenotypes from the UCLA health biobank (105) and demonstrated its potential to significantly improve clinical phenotype inferences compared to traditional polygenic risk scores. Talmor-Barkan et al. integrated extensive clinical and multi-omics profiling of 199 patients with acute coronary syndrome (ACS) to understand the multifaceted nature of CAD (106). By integrating serum metabolomics, gut microbiome data, and data from two major Israeli hospitals, the study identified distinct serum and gut microbial signatures in ACS patients compared to controls: the former lacked a previously unidentified bacterial species from the Clostridiaceae family, the absence of which was connected to various circulating metabolites known to increase CAD risk. This finding emphasizes the personalized nature of metabolic deviations in ACS patients and their ties to important clinical factors and cardiovascular outcomes; it also underscores the potential early involvement of metabolic disturbances linked to the microbiome and diet in dysmetabolic phases that predate clinically apparent CAD. Furthermore, using a metabolomics-based model, the authors observed that ACS patients’ predicted body mass index exceeded actual measurements, and these predictions correlated with diabetes mellitus and CAD severity. All told, this study highlights the potential of the serum metabolome in comprehending the diverse risk factors associated with CAD. In a different study, Parisot et al. utilized graph convolutional networks (GCNs) to introduce a comprehensive framework that integrates both imaging and non-imaging data for brain analysis in large populations (107). In their approach, the population is represented as a sparse graph where nodes correspond to imaging-based feature vectors and edge weights to phenotypic data, including genetic information. Such network-based integration captures the interplay of individual features and their interactions holistically. They tested this method on two broad datasets: ABIDE for predicting autism spectrum disorder and ADNI for predicting conversion to AD. The results showed a remarkable performance improvement over existing techniques, achieving a classification accuracy of 70.4% for ABIDE and 80.0% for ADNI. This highlights the potential of neuroimaging and phenotypic data integration using GCN to improve disease prediction. Mathew et al. undertook a comprehensive immune profiling of hospitalized COVID-19 patients using high-dimensional flow cytometry (108). Their analysis at the single-cell omics level revealed three distinct immunotypes, each associated with differing degrees of disease severity and clinical outcomes, and the longitudinal analysis highlighted stability and fluctuations in patient responses. These findings not only offer a comprehensive map of immune cell responses in COVID-19 but also highlight potential avenues for therapeutic interventions.
5. Future directions of multi-modal omics data analysis
In the previous section, we discussed how the scale of data integration and the utilization of machine learning have revolutionized the landscape of multi-modal omics data analysis, with a focus on advancing precision medicine. However, the rapid and large-scale development of state-of-the-art large language models (LLMs), including GPT-4 (109), PaLM2 (110), and LLaMA2 (111), has introduced artificial intelligence methods with unprecedented and remarkable performance capabilities. These developments anticipate a new paradigm shift in multimodal omics data analysis. Accordingly, this section explores the trends, opportunities, and challenges that will emerge as the field embraces the potential of LLMs and other cutting-edge technologies to unlock the complexity of multimodal omics data for the benefit of precision medicine and biomedical research.
Multimodal LLMs are designed to accommodate inputs of multiple modalities, extending beyond text-based data (Table 2). Notably, ChatGPT has profoundly influenced a number of fields, reshaping our interaction with technology. In addition to the increasing public interest in ChatGPT, as LLMs demonstrate remarkable achievements in the biomedical domain, numerous experimental methodologies applied to multimodal data are emerging, promising to open new avenues of discovery and innovation. In particular, building on established medical LLMs (112–114), several medical multimodal LLMs have been proposed (115, 116). These models are poised to revolutionize medical research and healthcare by enabling interactive data-driven communication, information extraction, and insights into model results. One key differentiator between LLMs and traditional machine learning is their ability to engage in data-driven communication, extract information, and provide insights into the underlying reasons and basis for the model’s results. In essence, as new multimodal omics data is generated and the analyzed knowledge is fed into LLMs, they can infer new insights based on previously accumulated data, further advancing our understanding of complex diseases and enabling more informed healthcare decision-making.
Table 2.
General-purpose and Biomedical-specified large language model (LLM)
| Models | Modalities | Foundation models | Model size | LLM Finetuning | |
|---|---|---|---|---|---|
| General purpose multi-modal LLM | Flamingo | Language, Vision | Chinchilla, NFNets | 3B, 9B, 80B | No finetuning |
| LLaVA | Language, Vision | Vicuna/LLaMA2 (111, 121) CLIP ViT (122) |
7B, 13B | LLM finetuning | |
| BLIP2 | Language, Vision | OPT/Flan-T5 (123, 124) CLIP ViT+Q-former (122) |
3B, 7B, 12B | No finetuning | |
| MiniGPT4 | Language, Vision | Vicuna (121) CLIP ViT+Q-former (122) |
7B, 13B | No finetuning | |
| InstructBLIP | Language, Vision | BLIP2 | 7B, 13B | No finetuning | |
| KOSMOS-1 | Language, Vision | Magnato (125) CLIP ViT (122) |
1.6B | LLM finetuning | |
| PaLM E | Language, Vision | PaLM (126) ViT (127) |
12B, 66B, 84B, 562B | LLM finetuning | |
| mPLUG-OWL | Language, Vision | LLaMA (128) CLIP ViT (122) |
7.2B | LLM finetuning | |
| mPLUG- Doc OWL | Language, Vision, Chart Document |
mPLUG-OWL | Unknown | LLM finetuning | |
| GPT4 | Language, Vision, Chart, Document |
Unknown | Unknown | ||
| Medical multi-modal LLM | LLaVA-Med | Language, Medical Vision: [X-ray, MRI, Pathology, Gross pathology, CT] | LLaVA | 7B, 13B | No |
| MedVInT | Language, Medical Vision: [X-ray, MRI, Pathology, Gross pathology, CT] | PMC-CLIP (129) PMC-LLaMA (130) |
7B | Yes | |
| Med-Flamingo | Language, Medical Vision: [X-ray, MRI, Pathology, Gross pathology, CT] | OpenFlamingo (131) | 9B | Yes | |
| Visual Med-Alpaca | Language, Medical Vision: [X-ray, MRI, Pathology, Gross pathology, CT] | 7B | No | ||
| RadFM | Language, Medical Vision: [X-ray, MRI, Pathology, Gross pathology, CT] | PMC-LLaMA (130) | 14B | Yes | |
| PathAsst | Language, Pathology | PLIP (132) Vicuna (121) |
13B | No | |
| PaLM-Med M | Language, Medical Vision: [X-ray, MRI, Pathology, Gross pathology, CT] |
PaLM-Med (114) | 12B,84B,562B | No | |
We briefly describe general-purpose LLMs and biomedical-specific LLMs in Table 2. For example, LLaVA can pinpoint anomalies in unconventional images (Supplementary Figure 1). LLaVA-Med, when presented with a radiology image, does not just decode its features but synthesizes medical knowledge to diagnose (Supplementary Figure 2). As multi-modal LLMs prove capable of such multimodal reasoning, we foresee vast potential in tackling complex challenges like multi-omics analysis. A conceptual framework named Generalist Medical AI (GMAI) (117) has been proposed with the aim of flexibly processing diverse medical modalities from imaging to genomics. Subsequently, Acosta et al. underscored the potential of multimodal AI in harnessing a myriad of biomedical datasets, from expansive biobanks to cost-effective sequencing (118). We likewise anticipate that multimodal medical AI will steer the next era of multi-omics analysis (Figure 2). A recent endeavor tasked ChatGPT with single-cell type prediction (119), which yielded accurate outcomes due to GPT-4’s extensive training on scientific literature enabling it to recognize marker genes and their cellular associations. This capability might be harnessed for more intricate tasks like multi-omics analysis.
Figure 2.

Potential ability of Large Language Models with Multi-modal omics data
Recent endeavors have demonstrated the efficacy of ChatGPT in tasks like single-cell type prediction, with impressive prediction performance outcomes (119). The impressive accuracy come from GPT-4’s comprehensive training on scientific literature, equipping it to identify marker genes and their corresponding cellular associations. There is potential to leverage this capability for multi-omics analysis interpretation. While there is a growing interest in employing foundation models for radiomics (120), and several multi-modal LLMs are able to stride in this integration (Table 2), they are currently lacking in handling molecular-level omics data. For instance, Med GPT-M have been conducted to analyze genomics, yet they still represent a proof-of-concept study with preliminary results. With technological advancements in the foundation model, advancements in understanding of molecular omics will enable researchers to become more adept at interpreting and integrating vast and diverse multi-omics data, thereby overcoming the current limitations of multi-modal LLMs. We describe a list of models trained on datasets like transcriptomes, DNA sequences, and protein sequences that may potentially underpin future multi-modal LLM development (Supplementary Table 3). While forging a multi-modal omics-centric LLM is a formidable challenge, we believe such harnessing of diverse modalities will redefine future research paradigms.
6. Conclusion
As we navigate the intricate landscape of biological systems, the integrative analysis of multimodal omics data emerges as a cornerstone in decoding the complexities inherent in health and disease. This approach, characterized by the confluence of genomics, transcriptomics, proteomics, metabolomics, and nuanced clinical data, including imaging phenotypes, represents a quantum leap in our collective scientific methodology.
This review encapsulated the multifaceted dimensions of this integrative process, addressing inherent challenges and spotlighting strategic methodologies that transcend traditional data analysis. From connection-based to transformation-based, and network-centric integrations, the scientific community now wields the power to unify the once-disparate data realms, offering fresh perspectives on the molecular underpinnings of pathological states.
The implications of this synthesis are profound, extending beyond theoretical research into the practical sphere of precision medicine. By harnessing the wealth of information within large-scale biobanks and electronic health records, we enhance our capacity for patient risk stratification, disease prognostication, and the customization of therapeutic interventions. Such endeavors are transforming the healthcare paradigm, empowering clinicians to curate treatment modalities reflective of individual biological signatures.
Peering into the future, we recognize the impending influence of advanced computational models, especially large-scale language models capable of multimodal reasoning. These innovative systems promise to redefine problem-solving in biomedical contexts, despite existing obstacles like resource constraints and data accessibility. Nonetheless, through economical and collaborative ventures, the pursuit of comprehensive, tailored solutions for multi-omics analyses is well underway.
In essence, the journey toward a thorough understanding of biological entities, via multimodal omics data integration, is leading biomedical research into uncharted territories. The evolution of machine learning strategies and the advent of sophisticated multimodal platforms are not just enhancing our present capabilities but are also charting the course for future breakthroughs. It is through these prisms that we foresee a revolution in personalized healthcare delivery, anchored by precision medicine’s tenets.
Supplementary Material
Acknowledgments
This work was supported by the National Institute of Health [R01 GM138597, R01 AG071470 and R01 HL169458].
Disclosure statement
The author is not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.
Literature cited
- 1.Reuter JA, Spacek DV, Snyder MP. 2015. High-throughput sequencing technologies. Mol Cell 58: 586–97 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Beesley LJ, Salvatore M. 2020. The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities. 39: 773–800 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Perakakis N, Yazdani A, Karniadakis GE, Mantzoros C. 2018. Omics, big data and machine learning as tools to propel understanding of biological mechanisms and to discover novel diagnostics and therapeutics. Metabolism 87: A1–a9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Vasaikar SV, Savage AK. 2023. A comprehensive platform for analyzing longitudinal multi-omics data. 14: 1684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bodein A, Scott-Boyer MP. 2022. Interpretation of network-based integration from multi-omics longitudinal data. 50: e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Campagna MP, Xavier A, Lechner-Scott J, Maltby V, Scott RJ, et al. 2021. Epigenome-wide association studies: current knowledge, strategies and recommendations. 13: 214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.2013. The Genotype-Tissue Expression (GTEx) project. Nat Genet 45: 580–5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gandal MJ. 2018. Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder. 362 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Huckins LM, Dobbyn A, Ruderfer DM, Hoffman G. 2019. Gene expression imputation across multiple brain regions provides insights into schizophrenia risk. 51: 659–74 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Brandes N, Linial N, Linial M. 2020. PWAS: proteome-wide association study-linking genes and phenotypes by functional variation in proteins. Genome Biol 21: 173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wingo TS, Liu Y, Gerasimov ES. 2021. Brain proteome-wide association study implicates novel proteins in depression pathogenesis. 24: 810–17 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Kojouri M, Pinto R, Mustafa R, Huang J, Gao H, et al. 2023. Metabolome-wide association study on physical activity. Sci Rep 13: 2374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Jiang MZ, Aguet F, Ardlie K, Chen J, Cornell E, et al. 2023. Canonical correlation analysis for multi-omics: Application to cross-cohort analysis. 19: e1010517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhao H, Rasheed H, Nøst TH, Cho Y, Liu Y, et al. 2022. Proteome-wide Mendelian randomization in global biobank meta-analysis reveals multi-ancestry drug targets for common diseases. Cell Genom 2: None [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Chen R, Mias GI, Li-Pook-Than J, Jiang L, Lam HY, et al. 2012. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell 148: 1293–307 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Li-Pook-Than J, Snyder M. 2013. iPOP goes the world: integrated personalized Omics profiling and the road toward improved health care. Chem Biol 20: 660–6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Karczewski KJ, Snyder MP. 2018. Integrative omics for health and disease. Nat Rev Genet 19: 299–310 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Sveinbjornsson G, Ulfarsson MO, Thorolfsdottir RB, Jonsson BA, Einarsson E, et al. 2022. Multiomics study of nonalcoholic fatty liver disease. 54: 1652–63 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ritchie SC, Surendran P, Karthikeyan S, Lambert SA, Bolton T, et al. 2023. Quality control and removal of technical variation of NMR metabolic biomarker data in ~120,000 UK Biobank participants. 10: 64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Sanna S, van Zuydam NR, Mahajan A. 2019. Causal relationships among the gut microbiome, short-chain fatty acids and metabolic diseases. 51: 600–05 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Moore C, Sambrook J, Walker M, Tolkien Z, Kaptoge S, et al. 2014. The INTERVAL trial to determine whether intervals between blood donations can be safely and acceptably decreased to optimise blood supply: study protocol for a randomised controlled trial. Trials 15: 363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Di Angelantonio E, Thompson SG, Kaptoge S, Moore C, Walker M, et al. 2017. Efficiency and safety of varying the frequency of whole blood donation (INTERVAL): a randomised trial of 45 000 donors. Lancet 390: 2360–71 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Gilly A, Suveges D, Kuchenbaecker K, Pollard M. 2018. Cohort-wide deep whole genome sequencing and the allelic architecture of complex traits. 9: 4674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Xu Y, Ritchie SC. 2023. An atlas of genetic scores to predict multi-omic traits. 616: 123–31 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Surendran P, Stewart ID, Au Yeung VPW, Pietzner M. 2022. Rare and common genetic determinants of metabolic individuality and their effects on human health. 28: 2321–32 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Zhernakova DV, Deelen P. 2017. Identification of context-dependent expression quantitative trait loci in whole blood. 49: 139–45 [DOI] [PubMed] [Google Scholar]
- 27.Bonder MJ, Luijk R, Zhernakova DV, Moed M, Deelen P. 2017. Disease variants alter transcription factor levels and methylation of their binding sites. 49: 131–38 [DOI] [PubMed] [Google Scholar]
- 28.Niehues A, Bizzarri D, Reinders MJT, Slagboom PE, van Gool AJ, et al. 2022. Metabolomic predictors of phenotypic traits can replace and complement measured clinical variables in population-scale expression profiling studies. BMC Genomics 23: 546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Sijtsma A, Rienks J. 2022. Cohort Profile Update: Lifelines, a three-generation cohort study and biobank. 51: e295–e302 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Bonder MJ, Kurilshikov A. 2016. The effect of host genetics on the gut microbiome. 48: 1407–12 [DOI] [PubMed] [Google Scholar]
- 31.van der Plaat DA, de Jong K. 2018. Occupational exposure to pesticides is associated with differential DNA methylation. 75: 427–35 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Leitsalu L, Haller T, Esko T, Tammesoo ML, Alavere H, et al. 2015. Cohort Profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int J Epidemiol 44: 1137–47 [DOI] [PubMed] [Google Scholar]
- 33.Aasmets O, Krigul KL, Lüll K. 2022. Gut metagenome associations with extensive digital health data in a volunteer-based Estonian microbiome cohort. 13: 869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Xu F, Yu EY, Cai X, Yue L, Jing LP, et al. 2023. Genome-wide genotype-serum proteome mapping provides insights into the cross-ancestry differences in cardiometabolic disease susceptibility. 14: 896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Feng YA, Chen CY, Chen TT, Kuo PH, Hsu YH, et al. 2022. Taiwan Biobank: A rich biomedical research database of the Taiwanese population. Cell Genom 2: 100197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Nagai A, Hirata M, Kamatani Y, Muto K, Matsuda K, et al. 2017. Overview of the BioBank Japan Project: Study design and profile. J Epidemiol 27: S2–s8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Tadaka S, Hishinuma E, Komaki S, Motoike IN, Kawashima J, et al. 2021. jMorp updates in 2020: large enhancement of multi-omics data resources on the general Japanese population. Nucleic Acids Res 49: D536–d44 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kuriyama S, Yaegashi N, Nagami F, Arai T, Kawaguchi Y, et al. 2016. The Tohoku Medical Megabank Project: Design and Mission. J Epidemiol 26: 493–511 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kim Y, Han BG. 2017. Cohort Profile: The Korean Genome and Epidemiology Study (KoGES) Consortium. Int J Epidemiol 46: e20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Hahn SJ, Kim S, Choi YS, Lee J, Kang J. 2022. Prediction of type 2 diabetes using genome-wide polygenic risk score and metabolic profiles: A machine learning analysis of population-based 10-year prospective cohort study. EBioMedicine 86: 104383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Jang HB, Hwang JY, Park JE, Oh JH, Ahn Y, et al. 2014. Intake levels of dietary polyunsaturated fatty acids modify the association between the genetic variation in PCSK5 and HDL cholesterol. J Med Genet 51: 782–8 [DOI] [PubMed] [Google Scholar]
- 42.Lee W, Lee HJ, Jang HB, Kim HJ, Ban HJ, et al. 2018. Asymmetric dimethylarginine (ADMA) is identified as a potential biomarker of insulin resistance in skeletal muscle. Sci Rep 8: 2133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.2019. Childhood Acute Illness and Nutrition (CHAIN) Network: a protocol for a multi-site prospective cohort study to identify modifiable risk factors for mortality among acutely ill children in Africa and Asia. BMJ Open 9: e028454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Njunge JM, Tickell K, Diallo AH, Sayeem Bin Shahid ASM, Gazi MA, et al. 2022. The Childhood Acute Illness and Nutrition (CHAIN) network nested case-cohort study protocol: a multi-omics approach to understanding mortality among children in sub-Saharan Africa and South Asia. 6: 77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Saw WY, Tantoso E, Begum H, Zhou L, Zou R, et al. 2017. Establishing multiple omics baselines for three Southeast Asian populations in the Singapore Integrative Omics Study. 8: 653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Baysoy A, Bai Z. 2023. The technological landscape and applications of single-cell multi-omics. 24: 695–713 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Macaulay IC, Haerty W, Kumar P, Li YI. 2015. G&T-seq: parallel sequencing of single-cell genomes and transcriptomes. 12: 519–22 [DOI] [PubMed] [Google Scholar]
- 48.Macaulay IC, Ponting CP, Voet T. 2017. Single-Cell Multiomics: Multiple Measurements from Single Cells. Trends Genet 33: 155–68 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Stoeckius M, Hafemeister C. 2017. Simultaneous epitope and transcriptome measurement in single cells. 14: 865–68 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Choi JR, Yong KW. 2020. Single-Cell RNA Sequencing and Its Combination with Protein and DNA Analyses. 9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Swanson E, Lord C. 2021. Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq. 10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Mimitou EP, Lareau CA. 2021. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. 39: 1246–58 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Darwiche R, Struhl K. 2020. Pheno-RNA, a method to associate genes with a specific phenotype, identifies genes linked to cellular transformation. 117: 28925–29 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Reel PS, Reel S, Pearson E, Trucco E, Jefferson E. 2021. Using machine learning approaches for multi-omics data analysis: A review. Biotechnol Adv 49: 107739. [DOI] [PubMed] [Google Scholar]
- 55.Vahabi N, Michailidis G. 2022. Unsupervised Multi-Omics Data Integration Methods: A Comprehensive Review. Front Genet 13: 854752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Subramanian I, Verma S, Kumar S, Jere A, Anamika K. 2020. Multi-omics Data Integration, Interpretation, and Its Application. 14: 1177932219899051 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Bersanelli M, Mosca E, Remondini D, Giampieri E, Sala C, et al. 2016. Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinformatics 17 Suppl 2: 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Kang M, Ko E. 2022. A roadmap for multi-omics data integration using deep learning. 23 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Leal LG, David A, Jarvelin MR, Sebert S, Männikkö M, et al. 2019. Identification of disease-associated loci using machine learning for genotype and network data integration. Bioinformatics 35: 5182–90 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Wang T, Shao W. 2021. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. 12: 3445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Wang C, Lue W, Kaalia R, Kumar P, Rajapakse JC. 2022. Network-based integration of multi-omics data for clinical outcome prediction in neuroblastoma. Sci Rep 12: 15425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, et al. 2014. Similarity network fusion for aggregating data types on a genomic scale. 11: 333–7 [DOI] [PubMed] [Google Scholar]
- 63.Skrede OJ, De Raedt S, Kleppe A, Hveem TS, Liestøl K, et al. 2020. Deep learning for prediction of colorectal cancer outcome: a discovery and validation study. Lancet 395: 350–60 [DOI] [PubMed] [Google Scholar]
- 64.Zhao B, Li T. 2021. Common genetic variation influencing human white matter microstructure. 372 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Zhao B, Luo T, Li T, Li Y, Zhang J, et al. 2019. Genome-wide association analysis of 19,629 individuals identifies variants influencing regional brain volumes and refines their genetic co-architecture with cognitive and mental health traits. 51: 1637–44 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Zhao B, Li T, Smith SM. 2022. Common variants contribute to intrinsic human brain functional networks. 54: 508–17 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Zhao B, Li T. 2023. Heart-brain connections: Phenotypic and genetic insights from magnetic resonance images. 380: abn6598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Zhao B 2023. Eye-brain connections revealed by multimodal retinal and brain imaging genetics in the UK Biobank. [DOI] [PMC free article] [PubMed]
- 69.Alipanahi B, Hormozdiari F, Behsaz B, Cosentino J, McCaw ZR, et al. 2021. Large-scale machine-learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology. Am J Hum Genet 108: 1217–30 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Li Y, Wu X, Yang P, Jiang G, Luo Y. 2022. Machine Learning for Lung Cancer Diagnosis, Treatment, and Prognosis. Genomics Proteomics Bioinformatics 20: 850–66 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Xu M, Calhoun V, Jiang R, Yan W, Sui J. 2021. Brain imaging-based machine learning in autism spectrum disorder: methods and applications. J Neurosci Methods 361: 109271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Liang X, Fu Y, Cao WT, Wang Z, Zhang K, et al. 2022. Gut microbiome, cognitive function and brain structure: a multi-omics integration analysis. Transl Neurodegener 11: 49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Wen J 2023. Neuroimaging-AI Endophenotypes of Brain Diseases in the General Population: Towards a Dimensional System of Vulnerability.
- 74.Bodein A, Scott-Boyer MP, Perin O, KA LC, Droit A. 2022. timeOmics: an R package for longitudinal multi-omics data integration. 38: 577–79 [DOI] [PubMed] [Google Scholar]
- 75.Metwally AA, Zhang T, Wu S, Kellogg R, Zhou W, et al. 2022. Robust identification of temporal biomarkers in longitudinal omics studies. 38: 3802–11 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Ang J-S, Ng K-W, Chua F-F. 2020. Modeling time series data with deep learning: A review, analysis, evaluation and future trend. Presented at 2020 8th International Conference on Information Technology and Multimedia (ICIMU) [Google Scholar]
- 77.Choi K, Yi J, Park C, Yoon S. 2021. Deep learning for anomaly detection in time-series data: review, analysis, and guidelines. IEEE Access 9: 120043–65 [Google Scholar]
- 78.LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521: 436–44 [DOI] [PubMed] [Google Scholar]
- 79.Bengio Y, Simard P, Frasconi P. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5: 157–66 [DOI] [PubMed] [Google Scholar]
- 80.Lee G, Nho K, Kang B, Sohn KA, Kim D. 2019. Predicting Alzheimer’s disease progression using multi-modal deep learning approach. Sci Rep 9: 1952. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Nguyen M, He T, An L, Alexander DC, Feng J, Yeo BTT. 2020. Predicting Alzheimer’s disease progression using deep recurrent neural networks. Neuroimage 222: 117203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Jung W, Jun E, Suk HI. 2021. Deep recurrent model for individualized prediction of Alzheimer’s disease progression. Neuroimage 237: 118143. [DOI] [PubMed] [Google Scholar]
- 83.Zhao B, Shan Y, Yang Y. 2021. Transcriptome-wide association analysis of brain structures yields insights into pleiotropy with complex neuropsychiatric traits. 12: 2878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Hu X, Qiao D, Kim W, Moll M, Balte PP, et al. 2022. Polygenic transcriptome risk scores for COPD and lung function improve cross-ethnic portability of prediction in the NHLBI TOPMed program. Am J Hum Genet 109: 857–70 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Liang Y, Pividori M, Manichaikul A, Palmer AA, Cox NJ, et al. 2022. Polygenic transcriptome risk scores (PTRS) can improve portability of polygenic risk scores across ancestries. 23: 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Mattheisen M, Grove J. 2022. Identification of shared and differentiating genetic architecture for autism spectrum disorder, attention-deficit hyperactivity disorder and case subgroups. 54: 1470–78 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Benchek P, Igo RP, Jr. 2021. Association between genes regulating neural pathways for quantitative traits of speech and language disorders. 6: 64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Li L, Chen Z, von Scheidt M, Li S, Steiner A, et al. 2022. Transcriptome-wide association study of coronary artery disease identifies novel susceptibility genes. Basic Res Cardiol 117: 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Mooney MA, Ryabinin P, Wilmot B. 2020. Large epigenome-wide association study of childhood ADHD identifies peripheral DNA methylation associated with disease and polygenic risk burden. 10: 8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Hesam-Shariati S, Overs BJ. 2022. Epigenetic signatures relating to disease-associated genotypic burden in familial risk of bipolar disorder. 12: 310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Sekula P, Goek ON, Quaye L, Barrios C, Levey AS, et al. 2016. A Metabolome-Wide Association Study of Kidney Function and Disease in the General Population. J Am Soc Nephrol 27: 1175–88 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Osborn MP, Park Y, Parks MB, Burgess LG, Uppal K, et al. 2013. Metabolome-wide association study of neovascular age-related macular degeneration. PLoS One 8: e72737. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Dehghan A, Pinto RC, Karaman I, Huang J, Durainayagam BR, et al. 2022. Metabolome-wide association study on ABCA7 indicates a role of ceramide metabolism in Alzheimer’s disease. 119: e2206083119 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Ge A, Sun Y, Kiker T, Zhou Y, Ye K. 2023. A metabolome-wide Mendelian randomization study prioritizes potential causal circulating metabolites for multiple sclerosis. J Neuroimmunol 379: 578105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Gilbert JA, Quinn RA, Debelius J, Xu ZZ, Morton J, et al. 2016. Microbiome-wide association studies link dynamic microbial consortia to disease. Nature 535: 94–103 [DOI] [PubMed] [Google Scholar]
- 96.Perez-Garcia J, Espuela-Ortiz A, Hernández-Pérez JM, González-Pérez R, Poza-Guedes P, et al. 2023. Human genetics influences microbiome composition involved in asthma exacerbations despite inhaled corticosteroid treatment. J Allergy Clin Immunol 152: 799–806.e6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Dai H, Hou T, Wang Q, Hou Y, Wang T. 2023. Causal relationships between the gut microbiome, blood lipids, and heart failure: a Mendelian randomization analysis. 30: 1274–82 [DOI] [PubMed] [Google Scholar]
- 98.Li Z, Zhu G, Lei X, Tang L, Kong G, et al. 2023. Genetic support of the causal association between gut microbiome and COVID-19: a bidirectional Mendelian randomization study. Front Immunol 14: 1217615. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Tian M, He X, Jin C, He X, Wu S, et al. 2021. Transpathology: molecular imaging-based pathology. Eur J Nucl Med Mol Imaging 48: 2338–50 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Rashid B, Calhoun V. 2020. Towards a brain-based predictome of mental illness. Hum Brain Mapp 41: 3468–535 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Wu J, Chen Y, Wang P, Caselli RJ, Thompson PM, et al. 2021. Integrating Transcriptomics, Genomics, and Imaging in Alzheimer’s Disease: A Federated Model. Front Radiol 1: 777030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Bao J, Wen J, Wen Z, Yang S, Cui Y, et al. 2023. Brain-wide genome-wide colocalization study for integrating genetics, transcriptomics and brain morphometry in Alzheimer’s disease. Neuroimage 280: 120346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Bergenstråhle J, Larsson L, Lundeberg J. 2020. Seamless integration of image and molecular analysis for spatial transcriptomics workflows. 21: 482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Johnson KB, Wei WQ, Weeraratne D, Frisse ME, Misulis K, et al. 2021. Precision Medicine, AI, and the Future of Personalized Health Care. Clin Transl Sci 14: 86–93 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Thompson M, Hill BL, Rakocz N, Chiang JN. 2022. Methylation risk scores are associated with a collection of phenotypes within electronic health record systems. 7: 50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Talmor-Barkan Y, Bar N, Shaul AA, Shahaf N, Godneva A, Bussi Y. 2022. Metabolomic and microbiome profiling reveals personalized risk factors for coronary artery disease. 28: 295–302 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Parisot S, Ktena SI, Ferrante E, Lee M, Guerrero R, et al. 2018. Disease prediction using graph convolutional networks: Application to Autism Spectrum Disorder and Alzheimer’s disease. Med Image Anal 48: 117–30 [DOI] [PubMed] [Google Scholar]
- 108.Mathew D, Giles JR. 2020. Deep immune profiling of COVID-19 patients reveals distinct immunotypes with therapeutic implications. 369 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 [Google Scholar]
- 110.Anil R, Dai AM, Firat O, Johnson M, Lepikhin D, et al. 2023. PaLM 2 Technical Report. ArXiv abs/2305.10403 [Google Scholar]
- 111.Touvron H, Martin L, Stone KR, Albert P, Almahairi A, et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 [Google Scholar]
- 112.Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y. 2023. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus 15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Toma A, Lawler PR, Ba J, Krishnan RG, Rubin BB, Wang B. 2023. Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding
- 114.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, et al. 2023. Large language models encode clinical knowledge. Nature 620: 172–80 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115.Li C, Wong C, Zhang S, Usuyama N, Liu H, et al. 2023. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. ArXiv abs/2306.00890 [Google Scholar]
- 116.Tu T, Azizi S, Driess D, Schaekermann M, Amin M, et al. 2023. Towards Generalist Biomedical AI. ArXiv abs/2307.14334 [Google Scholar]
- 117.Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, et al. 2023. Foundation models for generalist medical artificial intelligence. Nature 616: 259–65 [DOI] [PubMed] [Google Scholar]
- 118.Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. 2022. Multimodal biomedical AI. Nature Medicine 28: 1773–84 [DOI] [PubMed] [Google Scholar]
- 119.Hou W, Ji Z. 2023. Reference-free and cost-effective automated cell type annotation with GPT-4 in single-cell RNA-seq analysis. bioRxiv [Google Scholar]
- 120.Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, et al. 2017. A survey on deep learning in medical image analysis. Med Image Anal 42: 60–88 [DOI] [PubMed] [Google Scholar]
- 121.Zheng L, Chiang W-L, Sheng Y, Zhuang S, Wu Z, et al. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. ArXiv abs/2306.05685 [Google Scholar]
- 122.Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, et al. 2021. Learning Transferable Visual Models From Natural Language Supervision. Presented at International Conference on Machine Learning [Google Scholar]
- 123.Zhang S, Roller S, Goyal N, Artetxe M, Chen M, et al. 2022. OPT: Open Pre-trained Transformer Language Models. ArXiv abs/2205.01068 [Google Scholar]
- 124.Chung HW, Hou L, Longpre S, Zoph B, Tay Y, et al. 2022. Scaling Instruction-Finetuned Language Models. ArXiv abs/2210.11416 [Google Scholar]
- 125.Wang H, Ma S, Huang S, Dong L, Wang W, et al. 2022. Foundation Transformers. ArXiv abs/2210.06423 [Google Scholar]
- 126.Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, et al. 2022. PaLM: Scaling Language Modeling with Pathways. ArXiv abs/2204.02311 [Google Scholar]
- 127.Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, et al. 2020. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ArXiv abs/2010.11929 [Google Scholar]
- 128.Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, et al. 2023. LLaMA: Open and Efficient Foundation Language Models. ArXiv abs/2302.13971 [Google Scholar]
- 129.Lin W, Zhao Z, Zhang X, Wu C, Zhang Y, et al. 2023. PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents. Presented at International Conference on Medical Image Computing and Computer-Assisted Intervention [Google Scholar]
- 130.Wu C, Zhang X, Zhang Y, Wang Y, Xie W. 2023. PMC-LLaMA: Towards Building Open-source Language Models for Medicine [DOI] [PMC free article] [PubMed]
- 131.Awadalla A, Gao I, Gardner J, Hessel J, Hanafy Y, et al. 2023. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. ArXiv abs/2308.01390 [Google Scholar]
- 132.Huang Z, Bianchi F, Yuksekgonul M. 2023. A visual-language foundation model for pathology image analysis using medical Twitter. 29: 2307–16 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
