Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics

Chen Zhao; Kuan-Jui Su; Chong Wu; Xuewei Cao; Qiuying Sha; Wu Li; Zhe Luo; Tian Qin; Chuan Qiu; Lan Juan Zhao; Anqi Liu; Lindong Jiang; Xiao Zhang; Hui Shen; Weihua Zhou; Hong-Wen Deng

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Mar 12:arXiv:2310.07990v2. Originally published 2023 Oct 12. [Version 2]

Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics

Chen Zhao ^1,^#, Kuan-Jui Su ^2,^#, Chong Wu ³, Xuewei Cao ⁴, Qiuying Sha ⁴, Wu Li ², Zhe Luo ², Tian Qin ², Chuan Qiu ², Lan Juan Zhao ², Anqi Liu ², Lindong Jiang ², Xiao Zhang ², Hui Shen ², Weihua Zhou ^1,^5,^*, Hong-Wen Deng ^2,^*

PMCID: PMC10593076 PMID: 37873011

Abstract

Background:

Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies.

Method:

In this study, we propose a novel method that leverages the information from WGS data and reference metabolites to impute unknown metabolites. Our approach utilizes a multi-view variational autoencoder to jointly model the burden score, polygenetic risk score (PGS), and linkage disequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature extraction and missing metabolomics data imputation. By learning the latent representations of both omics data, our method can effectively impute missing metabolomics values based on genomic information.

Results:

We evaluate the performance of our method on empirical metabolomics datasets with missing values and demonstrate its superiority compared to conventional imputation techniques. Using 35 template metabolites derived burden scores, PGS and LD-pruned SNPs, the proposed methods achieved $R^{2}$ -scores > 0.01 for 71.55% of metabolites.

Conclusion:

The integration of WGS data in metabolomics imputation not only improves data completeness but also enhances downstream analyses, paving the way for more comprehensive and accurate investigations of metabolic pathways and disease associations. Our findings offer valuable insights into the potential benefits of utilizing WGS data for metabolomics data imputation and underscore the importance of leveraging multi-modal data integration in precision medicine research.

Keywords: Metabolomics, whole genome sequencing, imputation, multi-view, variational autoencoder

1. Introduction

Metabolomics is a scientific field that involves the systematic identification and quantification of a broad spectrum of small molecule metabolites present in biological samples, such as cells, tissue, and biological fluids [1]. Mass spectrometry (MS) is a significant high-throughput analytical technique utilized for profiling small molecular compounds, including metabolites, in biological samples [2,3]. The missing values in MS-based metabolomic data are often presented and challenging to handle [4,5], leading to a bias for the downstream analysis [6]. For downstream analysis using metabolomics data, a complete dataset is preferred and often required.

Many machine learning methods have been applied to impute within-omics metabolomics, such as k-nearest neighbors (KNN) imputation [7] and random forest regression (RF) [8]. However, existing within-omics imputation suffers from low accuracy in empirical practice. In addressing this limitation, the practicality of cross-omics based imputation becomes evident. The development of high-throughput omics technologies has revolutionized our ability to study biological systems at a molecular level [9]. These high-throughput techniques, including genomics, transcriptomics, proteomics, and epigenomics, allow us to profile the genetic expression and interaction of molecules from different biological perspectives [10]. In a recent comprehensive analysis using whole-genome sequencing (WGS), it was shown that blood metabolites display a high degree of heritability and consistency [11]. Discovering how genetic variants impact metabolites can provide valuable insights into the molecular mechanisms that influence the development of diseases. This positioning of metabolites along the pathway between genetic determinants and various health outcomes is significant [12]. Integrating these two disparate datasets has the potential to unlock invaluable information, facilitating a deeper understanding of missing value recovery and imputation. Using WGS data as a reference to perform cross-omics imputation for metabolomics data has garnered significant attention [13] for its ability to leverage genetic information in predicting metabolite abundances.

In this study, we propose a novel multi-view variational autoencoder (MVAE) framework for imputing missing values in metabolomics data, leveraging genetic information from WGS data. The workflow of the proposed approach is shown in Figure 1. Our method integrates multiple features, including burden scores from template metabolites, polygenic risk scores (PGS), and linkage disequilibrium (LD)-pruned single nucleotide polymorphisms (SNPs), for comprehensive feature extraction. By fusing information from both WGS and template metabolomics data, our approach achieves cross-omics imputation, enabling a more holistic understanding of the metabolic landscape.

Figure 1. — The architecture of the proposed MVAE for metabolomics data imputation using burden score, PGS, and LD pruned SNPs. MLP: multi-layer perceptron; PoE: product of experts.

2. Materials and methods

2.1. Enrolled subjects

The studied cohort was acquired from the Louisiana Osteoporosis Study (LOS) [14,15]. The LOS cohort is an ongoing research dataset (>17,000 subjects accumulated so far with recruitment starting in 2011), aimed at investigating both environmental and genetic risk factors for osteoporosis and other musculoskeletal diseases [16,17]. All participants signed an informed-consent document before any data collection, and the study was approved by the Tulane University Institutional Review Board. A total of 1,110 subjects with both WGS and metabolomics data were enrolled. The demographical information is shown in Table 1.

Table 1.

Demographic and Physical Characteristics of Participants (N=1,110)

Metric	Overall	Stratified by Sex		Stratified by Race
Metric	Overall	Female	Male	African American	White
Number of participants (n)	1110	126	984	418	692
Sex = Male (%)	984 (88.6%)			387 (92.6)	597 (86.3)
Race = White (%)	692 (62.3%)	95 (75.4)	597 (60.7)
Exercise = TRUE (%)	823 (74.1%)	87 (69.0)	736 (74.8)	279 (66.7)	544 (78.6)
Age (years)(mean (SD))	38.80 (10.87)	51.99 (14.07)	37.11 (9.10)	40.13 (9.26)	38.00 (11.66)
Height (cm) (mean (SD))	173.87 (7.78)	163.45 (5.96)	175.21 (6.93)	173.99 (7.53)	173.80 (7.93)
Weight (kg) (mean (SD))	81.72 (17.03)	72.57 (17.45)	82.89 (16.62)	82.69 (17.45)	81.13 (16.76)

Open in a new tab

Note: SD = Standard Deviation.

The detailed procedure for WGS has been described elsewhere [18]. Briefly, the WGS of the human peripheral blood DNA was performed with an average read depth of 22’ using a BGISEQ-500 sequencer (BGI Americas Corporation, Cambridge, MA, USA) of 350 bp paired-end reads [17]. The aligned and cleaned WGS data were mapped to the human reference genome (GRCh38/hg38) using Burrows-Wheeler Aligner software [19] following the recommended best practices for variant analysis with the Genome Analysis Toolkit (GATK) to ensure accurate variant calling. Genomic variations were detected by the HaplotypeCaller of GATK, and the variant quality score recalibration method was applied to obtain high-confidence variant calls [20].

This study employed the liquid chromatography-mass spectrometry (LC-MS) metabolomics platform developed by Metabolon, Inc. (Durham, NC, USA), where they were stored at −80°C until analysis. All samples were prepared according to the manufacturer’s protocol using the automated MicroLab STAR ^® system (Hamilton, USA). Proteins was precipitated using methanol under vigorous shaking for 2 min (Glen Mills GenoGrinder 2000), followed by centrifugation to recover chemically diverse metabolites. The extracts were then used as input to Waters ACQUITY ultra-performance liquid chromatography (UPLC) and a Thermo Scientific Q-Exactive high resolution/accurate MS interface with a heated electrospray ionization (HESI-II) source and Orbitrap mass analyzer operated at 35,000 mass resolution for positive and negative electrospray ionization. The process details have been described in prior studies [21,22]. We implemented rigorous quality control measures, including the use of a pooled matrix sample as a technical replicate, extracted water samples as process blanks, and the addition of a carefully selected QC standards cocktail to each sample. These measures ensured instrument performance monitoring, aided chromatographic alignment, and minimized interference. Instrument and process variability were assessed through median relative standard deviation calculations. Furthermore, we randomized experimental samples, eliminating biases and ensuring data reliability for all endogenous metabolites present in 100% of the pooled matrix samples.

2.2. Data processing

WGS data processing.

There were a total of 10,623,292 SNPs in the cohort with 1110 subjects. For quality control, we removed genetic variants with missing rates larger than 5% and Hardy-Weinberg equilibrium exact test p-values less than 10⁻⁴. Due to evolutionary dynamics, certain SNPs frequently exhibit variations in a population (referred to as “common” variants), while other SNPs remain identical in the vast majority of the population, with only a few individuals showing mutations (referred to as “rare” variants) - resulting in a form of class imbalance. In this study, we used the minor allele frequency (MAF) of 5% as the cut-off threshold to determine the common and rare variants in our following analyses. Polygenic risk scores, burden scores, and raw SNPs represent distinct genetic modalities that collectively provide a comprehensive view of an individual’s genetic predisposition, each contributing unique insights. Thus, we explored three different methods to encode the genetic modalities, including PGS, burden scores, and LD-pruned SNPs.

The polygenic score (PGS) is a quantitative measure to estimate an individual’s genetic risk to a specific trait or disease [23]. It is calculated as the weighted summation of the genetic variants, where the weights are based on their effect sizes to the trait of interest. The PGS represents the combined genetic risk across common or less common variants [24]. Since PGS offers several clinical benefits, including disease risk prediction, diagnosis, and prognosis [25], we employed the “pgsc_calc” workflow from the PGS Catalog to compute PGS scores using our in-house WGS data. Additionally, we associated our genetic variations with 3,335 predictive traits and diseases, introducing them as new input data [24]. The top 512 PGS with the highest variance were selected as the PGS features for each subject.
LD pruned SNPs. LD pruning is a method used to remove redundant genetic variants from a dataset to reduce the effects of LD [26]. The pruning method scans the pairwise correlated SNPs and kept the one with higher MAF. After performing LD pruning for the common variants, 266,240 SNPs were retained and the top 1,024 SNPs with the highest variance were used as the SNPs features. The LD pruning was performed using a window size of 50 with a step size of 5 and a pairwise R² threshold of 0.5.
Burden score. We considered rare variants separately using a widely used burden score [27]. Briefly, for each metabolite, we first regressed metabolite abundance level on the first two genetic principal components to adjust potential population stratification. Second, the burden score was calculated as the summation of each metabolite residual multiplied by the allele count of each rare variant across the genome individually. The top 512 burden score features with the highest variance were selected as a new genetic modality. Burden scores are particularly effective in aggregating rare genetic variants within a specific genomic region or gene. Instead of analyzing each rare variant individually, which might require a large sample size to detect associations, burden scores group these variants together based on their collective impact, which effectively improves the performance of the metabolomics data imputation.

Note that the burden score was calculated with the involvement of metabolites. To avoid using output values as input in our method, we created a new template metabolite that utilizes highly correlated metabolites as a new dependent variable and prior knowledge for model training and testing. In detail, we employ a template set containing $M$ metabolites as the template metabolites. During the model training, the Pearson correlation between a given predicted metabolite and each metabolite in the template set among the enrolled subjects was calculated. The metabolite in the template set with the highest correlation was selected as the template metabolite to calculate the burden score with the selected rare variants.

As a result, three views were obtained to characterize the genetic information, including the PGS scores, LD pruned SNPs, and burden scores.

Metabolomics data processing.

In the metabolomics profile, we identified 1,839 metabolites. To prove the concept of our developed model, we selected 497 metabolites with missing rate < 5% in our study. To validate and benchmark our proposed method, the missing values were excluded during the experiments and the subjects with the presented metabolites were included.

2.3. MVAE

To enhance the clarity of notation, especially regarding vectors, scalars, and their impact on mathematically derived relations, we use italic bolded font for vector, bolded for matrix, and italic for scalar.

Before introducing multi-view variational autoencoder, we first introduce the variational autoencoder (VAE). VAE was proposed by Kingma et al. [28], is a latent variable generative model which learns the deep representation of the input data. The goal of VAE is to maximize the marginal likelihood of the data (a.k.a evidence), which can be decomposed into a sum over marginal log-likelihoods of individual features, as illustrated in Eq. 1.

log p_{θ} (X^{(i)}) = D_{K L} (q_{ϕ} (Z ∣ X^{(i)}) ∥ p_{θ} (Z ∣ X^{(i)})) + ℒ (θ, ϕ; X^{(i)})

where $X^{(i)}$ is the feature vector for $i$ -th subject in the dataset ${\{X^{(i)}\}}_{i = 1}^{N}, N$ is the number of subjects, $Z$ is a random variable in the latent space, $q_{ϕ}$ is the posterior approximation of $Z$ with the learnable parameters $ϕ, p_{θ}$ is the ground truth posterior distribution of $Z$ with the intractable parameters $θ$ , and $D_{K L} (\cdot ∥ \cdot)$ represents the Kullback-Leibler (KL) divergence between the approximated posterior distribution and the ground truth posterior distribution. Because of the non-negativity of the KL divergence, the log-likelihood $log p_{θ} (X^{(i)}) \geq ℒ (θ, ϕ; X^{(i)})$ . If the approximated posterior distribution $q_{ϕ} (Z ∣ X^{(i)})$ is identical to the ground truth posterior distribution $p_{θ} (Z ∣ X^{(i)})$ , then the $log p_{θ} (X^{(i)}) = ℒ (θ, ϕ; X^{(i)})$ . Therefore, $ℒ (θ, ϕ; X^{(i)})$ is called the evidence lower bound (ELOB), which is defined by Eq. 2.

ℒ (θ, ϕ; X^{(i)}) = log p_{θ} (X^{(i)}) - D_{K L} (q_{ϕ} (Z ∣ X^{(i)}) ∥ p_{θ} (Z ∣ X^{(i)})) = E_{q_{ϕ} (Z ∣ X^{(i)})} [log p_{θ} (X^{(i)} ∣ Z)] - D_{K L} (q_{ϕ} (Z ∣ X^{(i)}) ∥ p_{θ} (Z ∣ X^{(i)}))

(2)

Thus, minimizing the KL divergence is equivalent to maximizing the ELOB. To train the model explicitly and implement the loss function in a closed form, we parameterize the $q_{ϕ}$ as a multivariate normal distribution (multivariate Gaussian distribution) with an approximately diagonal variance-covariance matrix. Then the analytical solution for the KL divergence is shown in Eq. 3.

D_{K L} (q_{ϕ} (Z ∣ X^{(i)}) ∥ p_{θ} (Z ∣ X^{(i)})) = \frac{1}{2} \sum_{d = 1}^{D} ({(μ_{d}^{(i)})}^{2} + {(σ_{d}^{(i)})}^{2} - log ({(σ_{d}^{(i)})}^{2}) - 1)

(3)

where $D$ is the number of the latent variables extracted by the VAE, and $μ_{d}^{(i)}$ and ${(σ_{d}^{(i)})}^{2}$ are the approximate mean and variance of the posterior distribution of $d$ -th latent variable for $i$ -th subject.

We extend the VAE from single-view input into multi-view input fashion for multi-view metabolomics data imputation. Notably, as the fact that the product of Gaussian distributions is also a Gaussian distribution [29,30], we apply the Product of the Expert (PoE) to generate the common latent space for the variation inference with an analytical solution. Suppose that under the multi-view setting, we have the data in $V$ views, i.e. $X_{1}, X_{2}, \dots, X_{V}$ . For the data in $v$ -th view $(v \in {1, \dots, V})$ , a nonlinear function implemented by a neural network is employed as the encoder, denoted as $q_{ϕ_{v}} (Z_{v} ∣ X_{v}^{(i)})$ , where $ϕ_{v}$ represents the learnable parameters of the nonlinear function for $v$ -th view. For each encoder, we estimate the mean vector and the variance-covariance matrix of multivariate Gaussian distribution for the approximate posterior distribution, denoted as $μ_{v}^{(i)}$ and $Σ_{v}^{(i)}$ for $i$ -th subject, and we assume $μ_{v}^{(i)} \in R^{D}$ is a vector and $Σ_{v}^{(i)} \in R^{D \times D}$ is a diagonal matrix. In our implementation, we employ multi-layer perceptron (MLP) as the encoder. To guarantee the positivity of the covariance, the output of the MLP is denoted as the $log Σ_{v}^{(i)}$ first and then is converted to $Σ_{v}^{(i)}$ using the exponential function. Formally, the encoder is defined in Eq. 4.

q_{ϕ_{v}} (Z_{v} ∣ X_{v}^{(i)}) = 𝒩 (μ_{v}^{(i)}, Σ_{v}^{(i)}) = \frac{1}{(2 π)^{D / 2} \sqrt{|Σ_{v}^{(i)}|}} exp (- \frac{1}{2} {(Z_{v} - μ_{v}^{(i)})}^{T} {(Σ_{v}^{(i)})}^{- 1} (Z_{v} - μ_{v}^{(i)})) μ_{v}^{(i)} = M L P_{v}^{μ} (X_{v}^{(i)}) Σ_{v}^{(i)} = exp (M L P_{v}^{Σ} (X_{v}^{(i)}))

(4)

where $Z_{v} \in R^{D}$ is the latent variable extracted by $v$ -th view with the dimension of $D . M L P_{v}^{μ}$ and $M L P_{v}^{Σ}$ are the neural networks for calculating mean and covariance for the Gaussian distribution, respectively. Let $T_{v}^{(i)} = {(Σ_{v}^{(i)})}^{- 1}$ , then the multivariate Gaussian distribution for $v$ -th view is rewritten as Eq. 5.

q_{ϕ_{v}} (Z_{v} ∣ X_{v}^{(i)}) = \frac{1}{(2 π)^{D / 2} \sqrt{|Σ_{v}^{(i)}|}} exp (- \frac{1}{2} Z_{v}^{T} T_{v}^{(i)} Z_{v} + {(μ_{v}^{(i)})}^{T} T_{v}^{(i)} Z_{v} + Δ_{v}^{(i)})

(5)

where $Δ_{v}^{(i)} = - \frac{1}{2} {(μ_{v}^{(i)})}^{T} T_{v}^{(i)} μ_{v}^{(i)} - \frac{D}{2} log 2 π + \frac{1}{2} log |T_{v}^{(i)}|$ . A PoE modeled the target posterior distribution of the common latent variable from multi-view as the product of the individual posterior distribution of the latent variable from single-view. According to Eq. 5, $Δ_{v}^{(i)}$ was not related to the latent variable $Z_{v}$ . Therefore, for the following analysis, $Δ_{v}^{(i)}$ was considered as a constant. As a result, the PoE generated the common latent variable $Z$ , which was defined in Eq. 6.

q_{ϕ} (Z ∣ X_{1}^{(i)} \dots X_{V}^{(i)}) = \frac{1}{V} \prod_{v = 1}^{V} q_{ϕ_{v}} (Z_{v} ∣ X_{v}^{(i)})

(6)

Eq. 6 indicated that the multivariate Gaussian distribution of the common latent variable was defined by the product of the multivariate Gaussian distribution of the latent variable extracted by $V$ views. According to the approximated posterior distribution of the common latent variable, $Z$ , was derived in Eq. 7.

q_{ϕ} (Z ∣ X_{1}^{(i)} \dots X_{V}^{(i)}) = 𝒩 (μ_{z}^{(i)}, Σ_{z}^{(i)}), μ_{z}^{(i)} = (\sum_{v = 1}^{V} {(μ_{v}^{(i)})}^{T} T_{v}^{(i)}) {(\sum_{v = 1}^{V} T_{v}^{(i)})}^{- 1} Σ_{z}^{(i)} = {(\sum_{v = 1}^{V} T_{v}^{(i)})}^{- 1}

(7)

where $μ_{z}^{(i)}$ and $Σ_{z}^{(i)}$ were the mean vector and variance-covariance matrix of the approximated posterior distribution of common latent variable for the $i$ -th subject. To make the neural network differentiable, we adopted the reparameterization trick [28] to reparametrize the mean vector and the diagonal variance-covariance matrix of the multi-variate Gaussian distribution, as shown in Eq. 8.

Z^{(i)} = μ_{z}^{(i)} + {(Σ_{z}^{(i)})}^{1 / 2} ⊙ ϵ_{z}

(8)

where $ϵ_{z} ~ 𝒩 (0, I)$ and $⊙$ indicates the element-wise product. Similar to the architecture of the encoder, we employed MLPs as the decoder to restore the integrated features, denoted as $f_{v}^{d e c}$ for the $v$ -th view. Formally, the reconstructed features for the $v$ -th view was denoted as ${\hat{X}}_{v}^{(i)}$ , and the reconstruction was defined in Eq. 9.

{\hat{X}}_{v}^{(i)} = M L P_{v}^{dec} (Z^{(i)})

(9)

2.4. Loss function and model training

20% of the subjects were randomly chosen as the test set, and the rest of the data were used as the training set. Since the product of the Gaussian distributions was another Gaussian distribution, we employed the ELOB designed for variational autoencoder with the explicit form as the objective function to optimize the neural network, as shown in Eq. 10.

ℒ (θ, ϕ; X_{1}, \dots, X_{V}) = \sum_{i = 1}^{N} \sum_{v = 1}^{V} E_{Z ~ q_{ϕ}} (Z^{(i)} ∣ X_{v}^{(i)}) log p_{θ} (X_{v}^{(i)} ∣ Z) - \sum_{i}^{N} D_{K L} (q_{ϕ} (Z^{(i)} ∣ X_{1}^{(i)}, \dots, X_{V}^{(i)}) ∥ p_{θ} (Z^{(i)} ∣ X_{1}^{(i)}, \dots, X_{V}^{(i)})) (10)

(10)

As shown in Eq. 10, the ELOB contained two terms, where the first term on the right hand side of Eq. 10 penalized the discrepancy between the reconstructed features and the input feature and the second term on the right hand side measured the KL-divergence between the prior and posterior distributions. The analytical form of the KL-divergence was derived according to VAE [31] and the overall loss function was shown in Eq. 11.

ℒ (θ, ϕ; X_{1}, \dots, X_{V}) = \sum_{i = 1}^{N} \sum_{v = 1}^{V} (X_{v}^{(i)} log ({\hat{X}}_{v}^{(i)}) + (1 - X_{v}^{(i)}) log (1 - {\hat{X}}_{v}^{(i)})) - (\frac{1}{2} \sum_{i = 1}^{N} \sum_{d = 1}^{D} ({(μ_{d}^{(i)})}^{2} + {(σ_{d}^{(i)})}^{2} - log {(σ_{d}^{(i)})}^{2} - 1))

(11)

where $μ_{d}^{(i)}$ and ${(σ_{d}^{(i)})}^{2}$ were the approximate mean and variance of the posterior distribution of the $d$ -th latent variable for the $i$ -th subject.

Since the burden score was generated according to the $M$ template, we designed an algorithm to train the MVAE based on prior knowledge, as shown in Algorithm 1.

In algorithm 1, the PGS and the LD-pruned SNPs were generated according to the WGS data; while the burden scores were calculated by the residual of the linear regression model trained using the PCs and the template metabolites. The designed algorithm assumed that the selected template metabolites exist in both the training subjects and the testing subjects. Thus, the burden scores for the template metabolites were presented in both the training subjects and the testing subjects. The designed algorithm was for feature-level metabolomics data imputation. The MVAE was built for training and prediction for one specific metabolite. To impute the missing metabolites in the dataset, multiple MVAEs were required to be trained.

In algorithm 1, the template set contained $M$ metabolites with corresponding PGS and LD-pruned SNPs. For each predicted metabolite, the Pearson correlation between $y_{t r}$ and each $y^{i^{t r}}$ in the metabolite template set was compared. The template metabolite with the highest Pearson correlation was selected and the burden score of the template metabolite was copied to form the training set $X_{t r} = \{X_{B S}^{i^{t r}}, X_{P G S}^{t r}, X_{L D}^{t r}\}$ . Using the training data $X_{t r} = \{X_{B S}^{i^{t r}}, X_{P G S}^{t r}, X_{L D}^{t r}\}$ and the corresponding metabolite $y_{t r}$ , the MVAE was built. During the testing, the template metabolite was used to generate the burden score $X_{B S}^{i te}$ from the template set $X_{B S}^{Template}$ , and then to generate the multi-view dataset $X_{t e} = \{X_{B S}^{i t e}, X_{P G S}^{t e}, X_{L D}^{t e}\}$ . Using the trained MVAE, the metabolite was imputed. The testing algorithm is shown in Algorithm 2.

2.5. Model evaluation

For model evaluation, mean absolute percentage error (MAPE) and $R^{2}$ -score were employed. A lower MAPE and a higher $R^{2}$ -score indicate better performance. 0 of MAPE indicates the perfect match. $R^{2}$ -score ranges from −∞ to 1, where 1 indicates the perfect match.

3. Results and discussion

3.1. Model performance for metabolomics data imputation

Our designed MVAE model was implemented using TensorFlow 2.5. We performed grid search to find the optimal neural network architecture. In our implementation, the used MLPs, including $M L P_{v}^{μ}, M L P_{v}^{Σ}$ and $M L P_{v}^{d e c}$ contained 2 fully connected layers with 128 neurons. The distribution for each view is a 64-dimensional Gaussian distribution and the product of these Gaussian distribution has the dimension of 64, i.e. $Z^{(i)} \in R^{64}$ . To generate an overall performance comparison, we set the cut-off thresholds of 0.01,0.05, 0.1, 0.15 and 0.2 for the overall $R^{2}$ -scores to measure performance of metabolite imputation; similarly, we determined the thresholds of 0.1, 0.15, 0.2 and 0.3 for the overall MAPEs to measure the performance of metabolite imputation. All the subsequent results were based on independent testing data.

In our dataset, 497 metabolites were enrolled. We tested the model performance with different number of metabolites as the templates. Then the rest 497 - $M$ metabolites were used to train MVAE and evaluate the model performance. We depicted the performance of the proposed MVAE in Figure 2 and the detailed performance is shown in Table S1 to S4 in the supplementary materials.

In Figure 2, the analysis of the model performance with respect to the number of template metabolites reveals interesting trends. As the number of template metabolites (M) increases up to 30, the model’s performance consistently improves, indicating that the inclusion of more template metabolites enhances the accuracy of the imputation. This suggests that a larger set of template metabolites provides more comprehensive information, enabling the model to make more accurate predictions.

However, beyond a certain point $(M > 35)$ , the trend changed, and the model’s performance did not exhibit consistent improvement. This observation suggests that there might be a saturation point, after which adding more template metabolites does not significantly contribute to enhancing the model’s accuracy. The results indicate that our model is robust, as it demonstrates the capacity to impute all metabolites enrolled in our dataset with only 7.04% (35/497) of known metabolites. This finding highlights the effectiveness of our approach in handling missing metabolomics data. Despite having access to only a small portion of known metabolites, our model is capable of accurately predicting and imputing the entire set of metabolites, making it a promising and practical solution for missing value imputation in mass spectrometry-based metabolomics data.

3.2. Model performance comparison

To illustrate the effectiveness of the designed MVAE, three multi-view integration algorithms were enrolled, including:

Multiview canonical correlation analysis (MCCA) [32]. MCCA extends the canonical correlation analysis (CCA) into multi-view settings. CCA is a typical subspace learning algorithm, aiming at finding the pairs of projections from different views with the maximum correlations. For more than 2 views, MCCA optimizes the sum of pairwise correlations.
Kernel CCA (KCCA) [33]. KCCA is based on MCCA, however, it adds a centered Gram matrix to perform the nonlinear transformation on the input data.
Kernel generalized CCA (KGCCA) [34]. KGCCA extends KCCA with a priori-defined graph connections between different views.

In addition, one of the novel aspects of this study is the cross-omics imputation, which incorporates both WGS data and template metabolites. To further highlight the effectiveness of our approach, we conducted a comparison by solely using metabolomics data for within-omics imputation. Specifically, the within-omics imputation involves the model utilizing the template metabolites to perform the imputation. We employed various compared models, including KNN, Ridge regression, support vector machine (SVM), RF regression, and gradient boosting regression (GBT). Each of these models used one template metabolite with the highest Pearson correlation with the imputed metabolite as input to impute the metabolite for the test subjects. Multiple compared models were constructed for different metabolites, and we evaluated the model performance using the MAPE and $R^{2}$ -score with the same cut-off thresholds as mentioned in Section 3.1. The number of template metabolites was fixed at $M = 35$ , and the comparison results are presented in Figure 3. The detailed performance is shown in Table S5 to S8 in the supplementary materials.

According to Figure 3, the designed MVAE model achieved the highest performance because the number of the metabolites with satisfactory $R^{2}$ -score was consistently higher than the compared methods. Regarding MAPE, the proposed MVAE achieved imputed metabolites with approximately 30% having a MAPE smaller than 0.3. This outcome is a strong indicator of the robustness and superior performance of the MVAE approach in effectively modeling high-dimensional multi-view genomics data.

As an increasing number of studies integrate WGS data with metabolomics data to gain a comprehensive understanding of the fundamental molecular underpinnings of biological processes and diseases, few methods are available to perform cross-omics-based imputation. MVAE leverages the power of variational autoencoders to learn the underlying latent representations from multi-view genomics data, enabling it to capture complex relationships and dependencies among different omics modalities. This ability to jointly model diverse genomic features contributes to the improved accuracy and reliability of the imputation process. Furthermore, MVAE demonstrates its adaptability to handle high-dimensional data, which is a common challenge in genomics research. By efficiently extracting relevant information from multiple views, MVAE effectively overcomes the curse of dimensionality and provides more accurate imputations even with limited information.

We also conducted an in-depth evaluation against within-omics imputation methods, including SVM, Ridge regression, random forest, gradient descent boosting, and KNN. The results demonstrated that MVAE consistently outperformed these within-omics imputation methods in terms of both $R^{2}$ -scores and MAPEs. This superior performance of the cross-omics approach can be attributed to MVAE’s ability to effectively leverage information from multiple views, which enhances its imputation accuracy and robustness. The within-omics imputation methods, on the other hand, rely solely on one template metabolite with the highest Pearson correlation, making them more limited in capturing the complex relationships and dependencies present in multi-view genomics data.

The higher $R^{2}$ -scores obtained by MVAE further emphasize its capacity to produce more accurate imputations, surpassing the performance of traditional within-omics imputation techniques. This improvement in imputation accuracy is of paramount importance in various applications, such as gene expression prediction, functional annotation, and pathway analysis, where precise imputation plays a crucial role in obtaining reliable downstream analysis results. Moreover, MVAE’s ability to achieve better MAPEs highlights its efficiency in imputing metabolites with a high degree of precision, ensuring minimal errors in the imputed values. Its ability to outperform traditional within-omics imputation methods and achieve superior $R^{2}$ -scores and MAPEs reaffirms its potential to revolutionize the field of multi-view genomics data integration and analysis. This is particularly significant in genomics research, as it reduces the impact of missing data on downstream analyses, leading to more robust and reliable interpretations.

In summary, our results highlight the effectiveness of the proposed MVAE model as a powerful tool for modeling high-dimensional multi-view genomics data. Its ability to achieve superior $R^{2}$ -scores compared to existing methods emphasizes its potential in addressing the challenges of data integration and imputation in the field of genomics research.

4. Conclusion

In this paper, we addressed the common challenge of missing data in mass spectrometry-based metabolomics data imputation by proposing a novel and effective multi-view information fusion method. We presented an MVAE framework to integrate common/rare variants and template metabolites for joint feature extraction and cross-omics data imputation. By learning latent representations from both omics data, our approach demonstrated superior imputation performance compared to conventional techniques. Our method achieved remarkable accuracy in imputing missing metabolomics values, with a significant $R^{2}$ -score (> 0.01) for 72.13% of metabolites, using only 35 template metabolites. These results underscored the potential of our approach to improve data completeness and enhance multi-omics integration studies. Overall, our proposed method showcased the benefits of combining WGS data with metabolomics in data imputation, paving the way for more comprehensive and accurate investigations in the fields of metabolomics and precision medicine.

Supplementary Material

NIHPP2310.07990V2-supplement-1.pdf^{(212.9KB, pdf)}

Acknowledgments

This research was supported in part by grants from the National Institutes of Health, USA (P20GM109036, R01AR069055, U19AG055373, R01AG061917, and R15HL172198) and NASA Johnson Space Center, USA contracts NNJ12HC91P and NNJ15HP23P. It was also supported in part by seed grants from the Michigan Technological University Institute of Computing and Cybersystems, a graduate fellowship from Michigan Technological University Health Research Institute, and a graduate fellowship from Portage Health Foundation.

Reference

[1].Dettmer K., Aronov P.A., Hammock B.D., Mass spectrometry-based metabolomics, Mass Spectrometry Reviews 26 (2007) 51–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Bijlsma S., Bobeldijk I., Verheij E.R., Ramaker R., Kochhar S., Macdonald I.A., Van Ommen B., Smilde A.K., Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation, Analytical Chemistry 78 (2006) 567–574. [DOI] [PubMed] [Google Scholar]
[3].Hrydziuszko O., Viant M.R., Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline, Metabolomics 8 (2012) 161–174. [Google Scholar]
[4].Little R.J., Rubin D.B., Statistical analysis with missing data, John Wiley & Sons, 2019. [Google Scholar]
[5].Gelman A., Hill J., Data analysis using regression and multilevel/hierarchical models, Cambridge university press, 2006. [Google Scholar]
[6].Wei R., Wang J., Su M., Jia E., Chen S., Chen T., Ni Y., Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data, Sci Rep 8 (2018) 663. 10.1038/s41598-017-19120-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Stekhoven D.J., Bühlmann P., MissForest—non-parametric missing value imputation for mixed-type data, Bioinformatics 28 (2012) 112–118. [DOI] [PubMed] [Google Scholar]
[8].Hastie T., Tibshirani R., Sherlock G., Eisen M., Brown P., Botstein D., Imputing missing data for gene expression arrays, (1999).
[9].Kim D., Li R., Dudek S.M., Ritchie M.D., ATHENA: Identifying interactions between different levels of genomic data associated with cancer clinical outcomes using grammatical evolution neural network, BioData Mining 6 (2013) 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Wang T., Shao W., Huang Z., Tang H., Zhang J., Ding Z., Huang K., MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification, Nat Commun 12 (2021) 3445. 10.1038/s41467-021-23774-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Wainschtein P., Jain D., Zheng Z., TOPMed Anthropometry Working Group, Aslibekyan S., Becker D., Bi W., Brody J., Carlson J.C., Correa A., Du M.M., Fernandez-Rhodes L., Ferrier K.R., Graff M., Guo X., He J., Heard-Costa N.L., Highland H.M., Hirschhorn J.N., Howard-Claudio C.M., Isasi C.R., Jackson R., Jiang J., Joehanes R., Justice A.E., Kalyani R.R., Kardia S., Lange E., LeBoff M., Lee S., Li X., Li Z., Lim E., Lin D., Lin X., Liu S., Lu Y., Manson J., Martin L., McHugh C., Mikulla J., Musani S.K., Ng M., Nickerson D., Palmer N., Perry J., Peters U., Preuss M., Qi Q., Raffield L., Rasmussen-Torvik L., Reiner A., Russell E.M., Sitlani C., Smith J., Spracklen C.N., Wang T., Wang Z., Wessel J., Xu H., Yaser M., Yoneyama S., Young K.A., Zhang J., Zhang X., Zhou H., Zhu X., Zoellner S., NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Abe N., Abecasis G., Aguet F., Almasy L., Alonso A., Ament S., Anderson P., Anugu P., Applebaum-Bowden D., Ardlie K., Arking D., Ashley-Koch A., Assimes T., Auer P., Avramopoulos D., Ayas N., Balasubramanian A., Barnard J., Barnes K., Barr R.G., Barron-Casella E., Barwick L., Beaty T., Beck G., Becker L., Beer R., Beitelshees A., Benjamin E., Benos T., Bezerra M., Bielak L., Bis J., Blackwell T., Blangero J., Bowden D.W., Bowler R., Broeckel U., Broome J., Brown D., Bunting K., Burchard E., Bustamante C., Buth E., Cade B., Cardwell J., Carey V., Carrier J., Carson A., Carty C., Casaburi R., Romero J.P.C., Casella J., Castaldi P., Chaffin M., Chang C., Chang Y.-C., Chavan S., Chen B.-J., Chen W.-M., Cho M., Choi S.H., Chuang L.-M., Chung R.-H., Clish C., Comhair S., Conomos M., Cornell E., Crandall C., Crapo J., Curran J., Curtis J., Custer B., Damcott C., Darbar D., David S., Davis C., Daya M., De Las Fuentes L., De Vries P., DeBaun M., Deka R., DeMeo D., Devine S., Dinh H., Doddapaneni H., Duan Q., Dugan-Perez S., Duggirala R., Durda J.P., Dutcher S.K., Eaton C., Ekunwe L., El Boueiz A., Emery L., Erzurum S., Farber C., Farek J., Fingerlin T., Flickinger M., Franceschini N., Frazar C., Fu M., Fullerton S.M., Fulton L., Gabriel S., Gan W., Gao S., Gao Y., Gass M., Geiger H., Gelb B., Geraci M., Germer S., Gerszten R., Ghosh A., Gibbs R., Gignoux C., Gladwin M., Glahn D., Gogarten S., Gong D.-W., Goring H., Graw S., Gray K.J., Grine D., Gross C., Gu C.C., Guan Y., Gupta N., Haas D.M., Haessler J., Hall M., Han Y., Hanly P., Harris D., Hawley N.L., Heavner B., Herrington D., Hersh C., Hidalgo B., Hixson J., Hobbs B., Hokanson J., Hong E., Hoth K., Hsiung C.A., Hu J., Hung Y.-J., Huston H., Hwu C.M., Irvin M.R., Jaquish C., Johnsen J., Johnson A., Johnson C., Johnston R., Jones K., Kang H.M., Kaplan R., Kelly S., Kenny E., Kessler M., Khan A., Khan Z., Kim W., Kimoff J., Kinney G., Konkle B., Kramer H., Lange C., Lee J., Lee S., Lee W.-J., LeFaive J., Levine D., Levy D., Lewis J., Li X., Li Y., Lin H., Lin H., Liu Y., Liu Y., Lunetta K., Luo J., Magalang U., Mahaney M., Make B., Manichaikul A., Manning A., Marton M., Mathai S., May S., McArdle P., McFarland S., McGoldrick D., McNeil B., Mei H., Meigs J., Menon V., Mestroni L., Metcalf G., Meyers D.A., Mignot E., Mikulla J., Min N., Minear M., Minster R.L., Moll M., Momin Z., Montasser M.E., Montgomery C., Muzny D., Mychaleckyj J.C., Nadkarni G., Naik R., Naseri T., Natarajan P., Nekhai S., Nelson S.C., Neltner B., Nessner C., Nkechinyere O., O’Connor T., Ochs-Balcom H., Okwuonu G., Pack A., Paik D.T., Palmer N., Pankow J., Papanicolaou G., Parker C., Peloso G., Peralta J.M., Perez M., Peyser P., Phillips L.S., Pleiness J., Pollin T., Post W., Becker J.P., Boorgula M.P., Qasba P., Qiao D., Qin Z., Rafaels N., Rajendran M., Rao D.C., Ratan A., Reed R., Reeves C., Reupena M.S., Rice K., Robillard R., Robine N., Roselli C., Ruczinski I., Runnels A., Russell P., Ruuska S., Ryan K., Sabino E.C., Saleheen D., Salimi S., Salvi S., Salzberg S., Sandow K., Sankaran V.G., Santibanez J., Schwander K., Schwartz D., Sciurba F., Seidman C., Seidman J., Sheehan V., Sherman S.L., Shetty A., Shetty A., Sheu W.H.-H., Silver B., Silverman E., Skomro R., Smith A.V., Smith J., Smith T., Smoller S., Snively B., Snyder M., Sofer T., Sotoodehnia N., Stilp A.M., Storm G., Streeten E., Su J.L., Sung Y.J., Sylvia J., Szpiro A., Taliun D., Tang H., Taub M., Taylor K.D., Taylor M., Taylor S., Telen M., Thornton T.A., Threlkeld M., Tinker L., Tirschwell D., Tishkoff S., Tiwari H., Tong C., Tracy R., Tsai M., Vaidya D., Van Den Berg D., VandeHaar P., Vrieze S., Walker T., Wallace R., Walts A., Wang F.F., Wang H., Wang J., Watson K., Watt J., Weeks D.E., Weinstock J., Weiss S.T., Weng L.-C., Willer C., Williams K., Williams L.K., Wilson C., Wilson J., Winterkorn L., Wong Q., Wu J., Xu H., Yang I., Yu K., Zekavat S.M., Zhang Y., Zhao S.X., Zhao W., Zody M., Cupples L.A., Shadyab A.H., McKnight B., Shoemaker B.M., Mitchell B.D., Psaty B.M., Kooperberg C., Liu C.-T., Albert C.M., Roden D., Chasman D.I., Darbar D., Lloyd-Jones D.M., Arnett D.K., Regan E.A., Boerwinkle E., Rotter J.I., O’Connell J.R., Yanek L.R., De Andrade M., Allison M.A., McDonald M.-L.N., Chung M.K., Fornage M., Chami N., Smith N.L., Ellinor P.T., Vasan R.S., Mathias R.A., Loos R.J.F., Rich S.S., Lubitz S.A., Heckbert S.R., Redline S., Guo X., Chen Y.-D.I., Laurie C.A., Hernandez R.D., McGarvey S.T., Goddard M.E., Laurie C.C., North K.E., Lange L.A., Weir B.S., Yengo L., Yang J., Visscher P.M., Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data, Nat Genet 54 (2022) 263–273. 10.1038/s41588-021-00997-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Long T., Hicks M., Yu H.-C., Biggs W.H., Kirkness E.F., Menni C., Zierer J., Small K.S., Mangino M., Messier H., Brewerton S., Turpaz Y., Perkins B.A., Evans A.M., Miller L.A.D., Guo L., Caskey C.T., Schork N.J., Garner C., Spector T.D., Venter J.C., Telenti A., Whole-genome sequencing identifies common-to-rare variants associated with human blood metabolites, Nat Genet 49 (2017) 568–578. 10.1038/ng.3809. [DOI] [PubMed] [Google Scholar]
[13].Kerkhofs M.H.P.M., Haijes H.A., Willemsen A.M., Van Gassen K.L.I., Van Der Ham M., Gerrits J., De Sain-van Der Velden M.G.M., Prinsen H.C.M.T., Van Deutekom H.W.M., Van Hasselt P.M., Verhoeven-Duif N.M., Jans J.J.M., Cross-Omics: Integrating Genomics with Metabolomics in Clinical Diagnostics, Metabolites 10 (2020) 206. 10.3390/metabo10050206. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Yang T.-L., Shen H., Liu A., Dong S.-S., Zhang L., Deng F.-Y., Zhao Q., Deng H.-W., A road map for understanding molecular and genetic determinants of osteoporosis, Nat Rev Endocrinol 16 (2020) 91–103. 10.1038/s41574-019-0282-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Yang T.-L., Guo Y., Li J., Zhang L., Shen H., Li S.M., Li S.K., Tian Q., Liu Y.-J., Papasian C.J., Deng H.-W., Gene-gene interaction between RBMS3 and ZNF516 influences bone mineral density, J Bone Miner Res 28 (2013) 828–837. 10.1002/jbmr.1788. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Greenbaum J., Su K.-J., Zhang X., Liu Y., Liu A., Zhao L.-J., Luo Z., Tian Q., Shen H., Deng H.-W., A multiethnic whole genome sequencing study to identify novel loci for bone mineral density, Human Molecular Genetics 31 (2022) 1067–1081. 10.1093/hmg/ddab305. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Qiu C., Yu F., Su K., Zhao Q., Zhang L., Xu C., Hu W., Wang Z., Zhao L., Tian Q., Wang Y., Deng H., Shen H., Multi-omics Data Integration for Identifying Osteoporosis Biomarkers and Their Biological Interaction and Causal Mechanisms, iScience 23 (2020) 100847. 10.1016/j.isci.2020.100847. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Schrimpe-Rutledge A.C., Codreanu S.G., Sherrod S.D., McLean J.A., Untargeted Metabolomics Strategies-Challenges and Emerging Directions, J. Am. Soc. Mass Spectrom. 27 (2016) 1897–1905. 10.1007/s13361-016-1469-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Li H., Durbin R., Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics 25 (2009) 1754–1760. 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A., The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data, Genome Res. 20 (2010) 1297–1303. 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Bridgewater Br E.A., High Resolution Mass Spectrometry Improves Data Quantity and Quality as Compared to Unit Mass Resolution Mass Spectrometry in High-Throughput Profiling Metabolomics, Metabolomics 04 (2014). 10.4172/2153-0769.1000132. [DOI] [Google Scholar]
[22].Evans A.M., DeHaven C.D., Barrett T., Mitchell M., Milgram E., Integrated, Nontargeted Ultrahigh Performance Liquid Chromatography/Electrospray Ionization Tandem Mass Spectrometry Platform for the Identification and Relative Quantification of the Small-Molecule Complement of Biological Systems, Anal. Chem. 81 (2009) 6656–6667. 10.1021/ac901536h. [DOI] [PubMed] [Google Scholar]
[23].Dudbridge F., Power and predictive accuracy of polygenic risk scores, PLoS Genetics 9 (2013) e1003348. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Birling M.-C., Yoshiki A., Adams D.J., Ayabe S., Beaudet A.L., Bottomley J., Bradley A., Brown S.D.M., Bürger A., Bushell W., Chiani F., Chin H.-J.G., Christou S., Codner G.F., DeMayo F.J., Dickinson M.E., Doe B., Donahue L.R., Fray M.D., Gambadoro A., Gao X., Gertsenstein M., Gomez-Segura A., Goodwin L.O., Heaney J.D., Hérault Y., de Angelis M.H., Jiang S.-T., Justice M.J., Kasparek P., King R.E., Kühn R., Lee H., Lee Y.J., Liu Z., Lloyd K.C.K., Lorenzo I., Mallon A.-M., McKerlie C., Meehan T.F., Fuentes V.M., Newman S., Nutter L.M.J., Oh G.T., Pavlovic G., Ramirez-Solis R., Rosen B., Ryder E.J., Santos L.A., Schick J., Seavitt J.R., Sedlacek R., Seisenberger C., Seong J.K., Skarnes W.C., Sorg T., Steel K.P., Tamura M., Tocchini-Valentini G.P., Wang C.-K.L., Wardle-Jones H., Wattenhofer-Donzé M., Wells S., Wiles M.V., Willis B.J., Wood J.A., Wurst W., Xu Y., International Mouse Phenotyping Consortium (IMPC), Gallegos J.J., Green J.R., Bohat R., Zimmel K., Pereira M., MacMaster S., Tondat S., Wei L., Carroll T., Cabezas J., Fan-Lan Q., Jacob E., Creighton A., Castellanos-Penton P., Danisment O., Clarke S., Joeng J., Kelly D., To C., van Bruggen R., Gailus-Durner V., Fuchs H., Marschall S., Dunst S., Romberger M., Rey B., Fessele S., Gormanns P., Friedel R., Kaloff C., Hörlein A., Teichmann S., Tasdemir A., Krause H., German D., Könitzer A., Weber S., Beig J., McKay M., Bedigian R., Dion S., Kutny P., Kelmenson J., Perry E., Nguyen-Bresinsky D., Seluke A., Leach T., Perkins S., Slater A., Petit M., Urban R., Kales S., DaCosta M., McFarland M., Palazola R., Peterson K.A., Svenson K., Braun R.E., Taft R., Rhue M., Garay J., Clary D., Araiza R., Grimsrud K., Bower L., Anchell N.L., Jager K.M., Young D.L., Dao P.T., Gardiner W., Bell T., Kenyon J., Stewart M.E., Lynch D., Loeffler J., Caulder A., Hillier R., Quwailid M.M., Zaman R., Santos L., Obata Y., Iwama M., Nakata H., Hashimoto T., Kadota M., Masuya H., Tanaka N., Miura I., Yamada I., Furuse T., Selloum M., Jacquot S., Ayadi A., Ali-Hadji D., Charles P., Le Marchand E., El Amri A., Kujath C., Fougerolle J.-V., Mellul P., Legeay S., Vasseur L., Moro A.-I., Lorentz R., Schaeffer L., Dreyer D., Erbs V., Eisenmann B., Rossi G., Luppi L., Mertz A., Jeanblanc A., Grau E., Sinclair C., Brown E., Kundi H., Madich A., Woods M., Pearson L., Mayhew D., Griggs N., Houghton R., Bussell J., Ingle C., Valentini S., Gleeson D., Sethi D., Bayzetinova T., Burvill J., Habib B., Weavers L., Maswood R., Miklejewska E., Cook R., Platte R., Price S., Vyas S., Collinson A., Hardy M., Dalvi P., Iyer V., West T., Thomas M., Mujica A., Sins E., Barrett D., Dobbie M., Grobler A., Loots G., Hayeshi R., Scholtz L.-M., Bester C., Pheiffer W., Venter K., Bosch F., Teboul L., Murray S.A., The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation, Nat Genet 53 (2021) 416–419. 10.1038/s41588-021-00825-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Wand H., Lambert S.A., Tamburro C., Iacocca M.A., O’Sullivan J.W., Sillari C., Kullo I.J., Rowley R., Dron J.S., Brockman D., Venner E., McCarthy M.I., Antoniou A.C., Easton D.F., Hegele R.A., Khera A.V., Chatterjee N., Kooperberg C., Edwards K., Vlessis K., Kinnear K., Danesh J.N., Parkinson H., Ramos E.M., Roberts M.C., Ormond K.E., Khoury M.J., Janssens A.C.J.W., Goddard K.A.B., Kraft P., MacArthur J.A.L., Inouye M., Wojcik G.L., Improving reporting standards for polygenic scores in risk prediction studies, Nature 591 (2021) 211–219. 10.1038/s41586-021-03243-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Calus M.P.L., Vandenplas J., SNPrune: an efficient algorithm to prune large SNP array and sequence datasets based on high linkage disequilibrium, Genet Sel Evol 50 (2018) 34. 10.1186/s12711-018-0404-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Lee S., Emond M.J., Bamshad M.J., Barnes K.C., Rieder M.J., Nickerson D.A., Christiani D.C., Wurfel M.M., Lin X., Optimal Unified Approach for Rare-Variant Association Testing with Application to Small-Sample Case-Control Whole-Exome Sequencing Studies, The American Journal of Human Genetics 91 (2012) 224–237. 10.1016/j.ajhg.2012.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Kingma D.P., Welling M., Auto-Encoding Variational Bayes, (2014). http://arxiv.org/abs/1312.6114 (accessed October 11, 2022).
[29].Shi Y., Siddharth N., Paige B., Torr P.H.S., Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models, (2019). http://arxiv.org/abs/1911.03393 (accessed August 7, 2022).
[30].Cao Y., Fleet D.J., Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process Predictions, (2015). http://arxiv.org/abs/1410.7827 (accessed August 3, 2022).
[31].Cinelli L.P., Marins M.A., Da Silva E.A.B., Netto S.L., Variational methods for machine learning with applications to deep networks, Springer, 2021. [Google Scholar]
[32].Kettenring J.R., Canonical analysis of several sets of variables, Biometrika 58 (1971) 433–451. 10.1093/biomet/58.3.433. [DOI] [Google Scholar]
[33].Arora R., Livescu K., Kernel CCA for multi-view learning of acoustic features using articulatory measurements, in: Symposium on Machine Learning in Speech and Language Processing, 2012. [Google Scholar]
[34].Tenenhaus A., Philippe C., Frouin V., Kernel Generalized Canonical Correlation Analysis, Computational Statistics & Data Analysis 90 (2015) 114–131. 10.1016/j.csda.2015.04.004. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHPP2310.07990V2-supplement-1.pdf^{(212.9KB, pdf)}

[R1] [1].Dettmer K., Aronov P.A., Hammock B.D., Mass spectrometry-based metabolomics, Mass Spectrometry Reviews 26 (2007) 51–78. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Bijlsma S., Bobeldijk I., Verheij E.R., Ramaker R., Kochhar S., Macdonald I.A., Van Ommen B., Smilde A.K., Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation, Analytical Chemistry 78 (2006) 567–574. [DOI] [PubMed] [Google Scholar]

[R3] [3].Hrydziuszko O., Viant M.R., Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline, Metabolomics 8 (2012) 161–174. [Google Scholar]

[R4] [4].Little R.J., Rubin D.B., Statistical analysis with missing data, John Wiley & Sons, 2019. [Google Scholar]

[R5] [5].Gelman A., Hill J., Data analysis using regression and multilevel/hierarchical models, Cambridge university press, 2006. [Google Scholar]

[R6] [6].Wei R., Wang J., Su M., Jia E., Chen S., Chen T., Ni Y., Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data, Sci Rep 8 (2018) 663. 10.1038/s41598-017-19120-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Stekhoven D.J., Bühlmann P., MissForest—non-parametric missing value imputation for mixed-type data, Bioinformatics 28 (2012) 112–118. [DOI] [PubMed] [Google Scholar]

[R8] [8].Hastie T., Tibshirani R., Sherlock G., Eisen M., Brown P., Botstein D., Imputing missing data for gene expression arrays, (1999).

[R9] [9].Kim D., Li R., Dudek S.M., Ritchie M.D., ATHENA: Identifying interactions between different levels of genomic data associated with cancer clinical outcomes using grammatical evolution neural network, BioData Mining 6 (2013) 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Wang T., Shao W., Huang Z., Tang H., Zhang J., Ding Z., Huang K., MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification, Nat Commun 12 (2021) 3445. 10.1038/s41467-021-23774-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Long T., Hicks M., Yu H.-C., Biggs W.H., Kirkness E.F., Menni C., Zierer J., Small K.S., Mangino M., Messier H., Brewerton S., Turpaz Y., Perkins B.A., Evans A.M., Miller L.A.D., Guo L., Caskey C.T., Schork N.J., Garner C., Spector T.D., Venter J.C., Telenti A., Whole-genome sequencing identifies common-to-rare variants associated with human blood metabolites, Nat Genet 49 (2017) 568–578. 10.1038/ng.3809. [DOI] [PubMed] [Google Scholar]

[R13] [13].Kerkhofs M.H.P.M., Haijes H.A., Willemsen A.M., Van Gassen K.L.I., Van Der Ham M., Gerrits J., De Sain-van Der Velden M.G.M., Prinsen H.C.M.T., Van Deutekom H.W.M., Van Hasselt P.M., Verhoeven-Duif N.M., Jans J.J.M., Cross-Omics: Integrating Genomics with Metabolomics in Clinical Diagnostics, Metabolites 10 (2020) 206. 10.3390/metabo10050206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Yang T.-L., Shen H., Liu A., Dong S.-S., Zhang L., Deng F.-Y., Zhao Q., Deng H.-W., A road map for understanding molecular and genetic determinants of osteoporosis, Nat Rev Endocrinol 16 (2020) 91–103. 10.1038/s41574-019-0282-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Yang T.-L., Guo Y., Li J., Zhang L., Shen H., Li S.M., Li S.K., Tian Q., Liu Y.-J., Papasian C.J., Deng H.-W., Gene-gene interaction between RBMS3 and ZNF516 influences bone mineral density, J Bone Miner Res 28 (2013) 828–837. 10.1002/jbmr.1788. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Greenbaum J., Su K.-J., Zhang X., Liu Y., Liu A., Zhao L.-J., Luo Z., Tian Q., Shen H., Deng H.-W., A multiethnic whole genome sequencing study to identify novel loci for bone mineral density, Human Molecular Genetics 31 (2022) 1067–1081. 10.1093/hmg/ddab305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Qiu C., Yu F., Su K., Zhao Q., Zhang L., Xu C., Hu W., Wang Z., Zhao L., Tian Q., Wang Y., Deng H., Shen H., Multi-omics Data Integration for Identifying Osteoporosis Biomarkers and Their Biological Interaction and Causal Mechanisms, iScience 23 (2020) 100847. 10.1016/j.isci.2020.100847. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Schrimpe-Rutledge A.C., Codreanu S.G., Sherrod S.D., McLean J.A., Untargeted Metabolomics Strategies-Challenges and Emerging Directions, J. Am. Soc. Mass Spectrom. 27 (2016) 1897–1905. 10.1007/s13361-016-1469-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Li H., Durbin R., Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics 25 (2009) 1754–1760. 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A., The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data, Genome Res. 20 (2010) 1297–1303. 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Bridgewater Br E.A., High Resolution Mass Spectrometry Improves Data Quantity and Quality as Compared to Unit Mass Resolution Mass Spectrometry in High-Throughput Profiling Metabolomics, Metabolomics 04 (2014). 10.4172/2153-0769.1000132. [DOI] [Google Scholar]

[R22] [22].Evans A.M., DeHaven C.D., Barrett T., Mitchell M., Milgram E., Integrated, Nontargeted Ultrahigh Performance Liquid Chromatography/Electrospray Ionization Tandem Mass Spectrometry Platform for the Identification and Relative Quantification of the Small-Molecule Complement of Biological Systems, Anal. Chem. 81 (2009) 6656–6667. 10.1021/ac901536h. [DOI] [PubMed] [Google Scholar]

[R23] [23].Dudbridge F., Power and predictive accuracy of polygenic risk scores, PLoS Genetics 9 (2013) e1003348. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Wand H., Lambert S.A., Tamburro C., Iacocca M.A., O’Sullivan J.W., Sillari C., Kullo I.J., Rowley R., Dron J.S., Brockman D., Venner E., McCarthy M.I., Antoniou A.C., Easton D.F., Hegele R.A., Khera A.V., Chatterjee N., Kooperberg C., Edwards K., Vlessis K., Kinnear K., Danesh J.N., Parkinson H., Ramos E.M., Roberts M.C., Ormond K.E., Khoury M.J., Janssens A.C.J.W., Goddard K.A.B., Kraft P., MacArthur J.A.L., Inouye M., Wojcik G.L., Improving reporting standards for polygenic scores in risk prediction studies, Nature 591 (2021) 211–219. 10.1038/s41586-021-03243-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Calus M.P.L., Vandenplas J., SNPrune: an efficient algorithm to prune large SNP array and sequence datasets based on high linkage disequilibrium, Genet Sel Evol 50 (2018) 34. 10.1186/s12711-018-0404-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Lee S., Emond M.J., Bamshad M.J., Barnes K.C., Rieder M.J., Nickerson D.A., Christiani D.C., Wurfel M.M., Lin X., Optimal Unified Approach for Rare-Variant Association Testing with Application to Small-Sample Case-Control Whole-Exome Sequencing Studies, The American Journal of Human Genetics 91 (2012) 224–237. 10.1016/j.ajhg.2012.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Kingma D.P., Welling M., Auto-Encoding Variational Bayes, (2014). http://arxiv.org/abs/1312.6114 (accessed October 11, 2022).

[R29] [29].Shi Y., Siddharth N., Paige B., Torr P.H.S., Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models, (2019). http://arxiv.org/abs/1911.03393 (accessed August 7, 2022).

[R30] [30].Cao Y., Fleet D.J., Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process Predictions, (2015). http://arxiv.org/abs/1410.7827 (accessed August 3, 2022).

[R31] [31].Cinelli L.P., Marins M.A., Da Silva E.A.B., Netto S.L., Variational methods for machine learning with applications to deep networks, Springer, 2021. [Google Scholar]

[R32] [32].Kettenring J.R., Canonical analysis of several sets of variables, Biometrika 58 (1971) 433–451. 10.1093/biomet/58.3.433. [DOI] [Google Scholar]

[R33] [33].Arora R., Livescu K., Kernel CCA for multi-view learning of acoustic features using articulatory measurements, in: Symposium on Machine Learning in Speech and Language Processing, 2012. [Google Scholar]

[R34] [34].Tenenhaus A., Philippe C., Frouin V., Kernel Generalized Canonical Correlation Analysis, Computational Statistics & Data Analysis 90 (2015) 114–131. 10.1016/j.csda.2015.04.004. [DOI] [Google Scholar]

PERMALINK

This is a preprint.

Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics

Chen Zhao

Kuan-Jui Su

Chong Wu

Xuewei Cao

Qiuying Sha

Wu Li

Zhe Luo

Tian Qin

Chuan Qiu

Lan Juan Zhao

Anqi Liu

Lindong Jiang

Xiao Zhang

Hui Shen

Weihua Zhou

Hong-Wen Deng

Abstract

Background:

Method:

Results:

Conclusion:

1. Introduction

Figure 1.

2. Materials and methods

2.1. Enrolled subjects

Table 1.

2.2. Data processing

WGS data processing.

Metabolomics data processing.

2.3. MVAE

2.4. Loss function and model training

2.5. Model evaluation

3. Results and discussion

3.1. Model performance for metabolomics data imputation

Figure 2.

3.2. Model performance comparison

Figure 3.

4. Conclusion

Supplementary Material

Acknowledgments

Reference

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases