Skip to main content
Bioinformatics Advances logoLink to Bioinformatics Advances
. 2024 Aug 24;4(1):vbae123. doi: 10.1093/bioadv/vbae123

An extension of latent unknown clustering integrating multi-omics data (LUCID) incorporating incomplete omics data

Yinqi Zhao 1,2, Qiran Jia 2,2,, Jesse Goodrich 3, Burcu Darst 4, David V Conti 5
Editor: Franca Fraternali
PMCID: PMC11368387  PMID: 39224838

Abstract

Motivation

Latent unknown clustering integrating multi-omics data is a novel statistical model designed for multi-omics data analysis. It integrates omics data with exposures and an outcome through a latent cluster, elucidating how exposures influence processes reflected in multi-omics measurements, ultimately affecting an outcome. A significant challenge in multi-omics analysis is the issue of list-wise missingness. To address this, we extend the model to incorporate list-wise missingness within an integrated imputation framework, which can also handle sporadic missingness when necessary.

Results

Simulation studies demonstrate that our integrated imputation approach produces consistent and less biased estimates, closely reflecting true underlying values. We applied this model to data from the ISGlobal/ATHLETE “Exposome Data Challenge Event” to explore the association between maternal exposure to hexachlorobenzene and childhood body mass index by integrating incomplete proteomics data from 1301 children. The model successfully estimated proteomics profiles for two clusters representing higher and lower body mass index, characterizing the potential profiles linking prenatal hexachlorobenzene levels and childhood body mass index.

Availability and implementation

The proposed methods have been implemented in the R package LUCIDus. The source code is available at https://github.com/USCbiostats/LUCIDus.

1 Introduction

Recent developments in biotechnologies have made omics data available for numerous cohort studies. For example, the Human Early-Life Exposome project (HELIX) measured molecular omics signatures, including DNA methylation, whole blood transcription, metabolites, and plasma proteins from 1301 children at the age of 6–11 in six European countries (Maitre et al. 2018). Such omics-rich cohort studies provide unprecedented opportunities to investigate the direct and indirect effects of exposures on complex disease phenotypes and to characterize the biological processes underlying these associations. Despite the potential, the non-independence and high dimensionality of multi-omics data bring up challenges in integrated statistical analysis, and innovative statistical methods are needed to address these issues.

Integrative genomic studies typically focus on linking the genome, epigenome, and transcriptome to a phenotype directly (Kristensen et al. 2014, Ritchie et al. 2015). Ritchie et al. summarized several existing methods and strategies of genomics integration including meta-dimensional and multi-staged analyses to enhance the understanding of the effects of genetics and genomics on complex outcomes. In contrast, environmental epidemiology studies with multi-omics data often aim to investigate patterns of multi-omics measurements, such as metabolites and proteins, and effects on a health outcome as a result of environmental exposures that precede current measurements or outcomes (Maitre et al. 2018, 2022, Jin et al. 2020, Stratakis et al. 2020, Wu et al. 2023). Guided by the underlying biology or the temporal sequence of measurements, these studies often share a common structure that relates the exposures to intermediate factors capturing transitional processes that ultimately result in an outcome. This suspected structure leads to analysis that can integrate multiple omics data acting on a disease or trait outcome via mediation or a latent structured model. Baccarelli et al. gave motivation for this type of precision environmental health in more detail (Baccarelli et al. 2023).

In terms of specific statistical methods for integrating multi-omics data, integrative clustering is a powerful and common approach to achieve dimension reduction while extracting key information (Pierre-Jean et al. 2020). An unsupervised clustering method called iCluster was proposed to conduct integrative clustering of multi-omics data using a joint latent variable model estimated by the expectation maximization (EM) algorithm (Shen et al. 2009, Shen et al. 2012, Mo et al. 2013). Pierre-Jean et al. also introduced other clustering methods including sparse generalized canonical correlation analysis (SGCCA) and similarity network fusion (SNF) (Tenenhaus et al. 2014, Wang et al. 2014). Besides clustering, other dimension reduction methods utilizing the decomposition of variance framework were proposed including joint and individual variation explained (JIVE), which functions as an extension of principal component analysis (PCA) (Lock et al. 2013). When taking the exposure into account, mediation models have been implemented to explore the underlying mechanism among exposures, multi-omics, and a phenotype. To link clustering approaches with environmental exposures often results in a two-step analysis in which clusters are estimated first and then subsequent mediation analysis is performed. Alternatively, high-dimensional mediation analysis may be directly performed. Song et al. extended their previous causal mediation analysis to high-dimensional multi-omics data by utilizing a Bayesian linear mixed model with continuous shrinkage priors on the key coefficients to obtain sparsity (Song et al. 2020a). Albert et al. and Derkach et al. presented useful statistical tools that incorporated building a latent variable model under the causal mediation framework (Albert et al. 2016, Derkach et al. 2019). Finally, to disentangle this complicated biological process, and effectively adopting the advantages of both clustering and mediation analysis, Peng et al. developed another model called latent unknown clustering integrating multi-omics data (LUCID). The LUCID model conducts integrative analysis linking omics data with exposomes and an outcome via a latent cluster to delineate distinct risk groups and exploit the underlying causal relationships among the variables of interest. This approach accounts for high-dimensional data by utilizing an L1 penalty (Tibshirani 1996) to obtain a sparse solution and facilitate model interpretation (Peng et al. 2020). This model has successfully identified biologically relevant omics features which link exposures with different disease phenotypes (Jin et al. 2020, Kasper et al. 2020, Stratakis et al. 2020, Maitre et al. 2022, Matta et al. 2022, Wu et al. 2023). For integrative genomic studies in which environmental exposures are not the primary focus, the LUCID model remains valuable. For example, LUCID can be used with germline genetic variants such as single nucleotide polymorphisms (SNPs) or polygenic risk scores (PRS) as the exposures. In this context, LUCID distinguishes the effects of germline genetic variants as they precede other omics data, focusing on how the genetics influences multiple omics levels and ultimately impact the outcome. Moreover, LUCID model aids in statistical estimation, as this genetic component can be coded as binary or ordinal, while other omics features are often continuous and high-dimensional, which makes it inappropriate if naively integrated (Ritchie et al. 2015). Goodrich et al. provide a more detailed discussion of these various statistical approaches and their pros and cons in the context of multi-omic analysis (Goodrich et al. 2024).

A significant challenge in the integrative analysis of multi-omics data is the problem of missingness for the omics measurements. In large cohort studies with exposures and outcomes measured for all individuals, it is common for some omics data to not be available for all participants due to budget limits or other factors such as failure to extract samples (Little and Rubin 2019). In the HELIX study, e.g. urine/serum metabolomics data are available in 1198 children, while miRNA is available for only 941 children (Maitre et al. 2018). This scenario, in which omics measurements are only available in a subset of samples, is known as a list-wise missing pattern. List-wise missingness can arise either completely randomly or due to systematic factors related to other measured variables, such as exposomes and disease phenotypes. Therefore, list-wise missingness is considered a scenario of either missing completely at random (MCAR) or missing at random (MAR) (Little and Rubin 2019). Another commonly observed missing pattern is when missing values occur in omics measurements across both samples and features, potentially due to measurement error, insufficient sample availability, or experimental constraints (Song et al. 2020b). Such missing pattern is also assumed to be MCAR or at least MAR (Little and Rubin 2019), and is referred to as sporadic missingness. In practice, missingness in omics data across both samples and features is less likely to be MCAR or MAR but missing not at random (MNAR). For example, missing values might emerge because the actual levels are below or beyond the limit of detection (LOD) of the technology (Yu et al. 2014).

There are several statistical methods available for addressing non-list-wise missing omics data when they are MCAR and MAR. Imputation methods based on chained equations including predictive mean matching are implemented in the R package mice (Buuren and Groothuis-Oudshoorn 2011). Likelihood based methods, such as the EM algorithm, are also popular approaches given their ease of implementation and flexible statistical framework (Little and Rubin 2019) with several approaches implemented in the R package missMethods (Rockel 2022). For integrated analysis, the above approaches would have to be implemented in a two-stage process. For example, Scrucca et al. proposed a two-step approach by first imputing missing values under a general location model and then conducting a clustering algorithm on the imputed dataset within a Gaussian mixture model (GMM) (Schafer 1997, Scrucca et al. 2016). This is implemented in the R package mclust, Zhang et al. extended the EM algorithm under the framework of a GMM to conduct clustering and imputation of missing data simultaneously (Zhang et al. 2021). The above approaches for missing data are appropriate for sporadic missingness but for list-wise missing patterns, most of these methods will treat the missing rows as MCAR and randomly generate new observations based on their estimated correlation structures, which overlooks the underlying MAR mechanism that the missingness might be related to exposures and the outcome. As an alternative, complete-case analysis (i.e. limiting the analysis to only individuals with observations for all variables) is easy to implement but might not be viable if a large number of samples have missing data (Pigott 2001). In the context of the LUCID model, these methods are potentially less efficient as they do not effectively incorporate the information from the exposures and the outcome given the assumptions of LUCID that omics levels are associated with the exposures and the outcome.

In this article, we extend the previously proposed LUCID model for integrated omics analysis to address the problem of list-wise missingness (or missing rows) in omics data with the assumption of MCAR or at least MAR. We derive the joint likelihood for LUCID by allowing omics data to be potentially missing. We propose a likelihood partition method for list-wise missingness and an integrated imputation framework for sporadic missingness, with both approaches implemented within an EM algorithm for maximum likelihood estimation. Although implemented, sporadic missingness should be approached carefully as it also relies on the strong assumptions of MCAR or MAR. We evaluate the performance of our approach through extensive simulation studies and demonstrate the advantage of the proposed method over complete-case analysis and other imputation methods, particularly for list-wise missingness. Finally, to illustrate the practical usefulness of the proposed method for addressing list-wise missingness in real omics data, we evaluate the impact of prenatal hexachlorobenzene (HCB) on childhood body mass index (BMI) with the integration of proteomic measurements. This analysis uses the publicly available “challenge data” from the ISGlobal/ATHLETE “Exposome Data Challenge Event”, simulated from the HELIX data (Maitre et al. 2022).

2 Methods

2.1 LUCID with complete omics data

We first review the statistical framework of the LUCID model with complete omics data. LUCID jointly models the genomic/environmental exposures G, other omics data Z, and phenotype trait Y (Fig. 1). Suppose we have a sample of n observations indexed by i=1, 2, , n. Let G be a n×p matrix with columns representing genetic or environmental exposures and rows being observations; Z be a n×m matrix of omics data with complete measurements; and Y be a n-length vector of phenotype trait.

Figure 1.

Figure 1.

DAG of LUCID model. The squares represent observed data G, Z, and Y, the circles represent unobserved latent variables (clusters) and model parameters, and the diamond refers to L1 penalty terms for regularization. Cov G and Cov Y represent covariates to be adjusted in the LUCID model. Missingness is allowed in omics data and divided into subsets of observations with complete measurements and observations with missingness.

The three data components (G, Z, and Y) are linked through a latent variable X consisting of k categories, each representing a latent cluster in the sample. In practice, k can arbitrarily set based on prior knowledge or via a grid search based on the overall model fit evaluated by the Bayesian information criterion (BIC).

The DAG in Fig. 1 implies the conditional independence among the distribution of X given G, Z given X, and Y given X. Additionally, we assume G, Z, and Y are measured through a prospective sampling procedure, so we do not model the distribution of G.

Since X is a discrete variable with k categories (indexed by j=1, 2, , k), we assume that X follows a multinomial distribution conditioning on G, denoted by the softmax function S(X=j|G;β). We further assume that omics data Z follows a multivariate Gaussian distribution conditioning on X, denoted by ϕ(Z|X=j;μj,Σj), where μj and Σj are mean and variance–covariance matrices, respectively, for latent cluster j. This assumption fits in the model-based clustering framework (Fraley and Raftery 2002). To include more flexible geometric features of latent clusters, such as volume, shape, and orientation determined by Σj, we use the parameterization of variance–covariance matrices by the eigenvalue decomposition in the form of

 Σj=λjDjAjDjT  (1)

where λj is a scalar, Dj is the orthogonal matrix of eigenvectors, and Aj is a diagonal matrix whose values are proportional to eigenvalues (Banfield and Raftery 1993). The outcome Y is either a continuous or a binary variable. For illustration purposes, we assume Y is a continuous outcome following Gaussian distribution denoted by ϕ(Y|γj, σj2) (γj is cluster-specific effect and σj2 is cluster-specific variance). The derivation for a binary outcome can be found elsewhere (Peng et al. 2020). We denote the observed data D={G, Z, Y}, the joint log-likelihood of the LUCID model is constructed as:

lΘ|D= i=1nlogfZi, Yi|Gi;Θ=i=1nlogj=1kSXi=jGi;βϕZiXi=j;μj,Σj)ϕ(Yi|γj, σj2)  (2)

where Θ is the generic notation for all parameters in the LUCID model.

Because X is a latent variable, we use an EM algorithm to obtain the maximum likelihood estimator (MLE) of Θ in (2). We define I(Xi=j) as an indicator function representing that observation i belongs to the latent cluster j. Then the log-likelihood function in (2) can be written as:

lΘ|D= i=1nj=1kIXi=j(logSXi=jGi;βj +logϕZiμj, Σj +logϕYiγj, σj2) (3)

We define the responsibility, r, as the posterior inclusion probability (PIP) of observation i belonging to latent cluster j, given observed data and current estimations of Θ at iteration t, which is

rijt=EIXi=jD;Θt=P(Xi=j|Gi, Zi, Yi;Θt) =SXi=jGi;βjtϕZiμjt, Σjtϕ(Yi|γj(t),σj2(t))j=1kSXi=jGi;βjtϕZiμjt, Σjtϕ(Yi|γj(t),σj2(t)) (4)

At iteration t, the E-step of the EM algorithm calculates the expectation of the complete data likelihood, denoted by Q(Θ|D, Θ^(t))

 QΘD, Θ^t= i=1nj=1krijtlogSXi=jGi;βj +i=1nj=1krijtlogϕZiμj, Σj +i=1nj=1krijtlogϕYiγj, σj2 (5)

The M-step maximizes (5) in terms of Θ, which results in the following estimations for iteration t+1:

βj(t+1)=argmaxβi=1nj=1krij(t)logS(Xi=j|Gi;βj) (6)
 μj(t+1)=i=1nrij(t)Ziji=1nrij(t)  (7)
 Σj(t+1)=i=1nrijtZij-μjZij-μjTi=1nrij(t)  (8)
 γj(t+1)=i=1nrij(t)Yii=1nrij(t)  (9)
 σj2(t+1)=i=1nrijtYi-γjt+12i=1nrij(t)  (10)

Note that (8) is a closed-form solution for Σj without any geometric constraints. Celeux and Govaert provide a detailed discussion of maximizing Σj, parameterized by the eigenvalue decomposition in (1) (Celeux and Govaert 1995). The R package mclust implements their algorithm (Scrucca et al. 2016), which we use to update Σj in the M-step.

2.2 LUCID with missing omics data

2.2.1 List-wise missing omics data

To incorporate list-wise missing omics data in the LUCID model, we propose to use the likelihood partition technique illustrated in Fig. 2A. The observations are divided into two disjoint subsets: subset {io=1, 2, , no} such that Zio is observed and subset {im=1, 2, , nm} such that Zim is completely missing. The likelihood function of the sample can be written as the sum of two components: (1) the joint likelihood of the subset {io} denoted by lo(Θ|D) and (2) the joint likelihood of the subset {im} remains the same as (3), while that of the subset {im} becomes

lmΘ|D= im=1nmj=1kIXim=j(logSXim=jGim;βj+logϕYimγj, σj2) (11)
Figure 2.

Figure 2.

Missing patterns that LUCID assumes. (A) Illustration of the list-wise missing pattern; (B) illustration of the sporadic missing pattern; (C) illustration of a more general case with a combination list-wise and sporadic missing pattern.

We can obtain the MLE of Θ under a list-wise missing pattern via a modification of the E-step of the EM algorithm discussed in Section 2.1. Equation (11) explicitly points out that lm(Θ|D) only consists of likelihood components related to G and Y. This results in the corresponding responsibility for the subset {im}, which is

rimj(t)=EIXim)=jD;Θt=P(Xim=j|Gim, Zim, Yim;Θt) =SXim=jGim;βjtϕ(Yim|Xim=j;γj(t),σj2(t))j=1kSXim=jGim;βjtϕ(Yim|Xim=j;γj(t),σj2(t)) (12)

For subset {io}, rioj(t) is the same as (4). Therefore, in the E-step, the expectation of the log-likelihood of LUCID with list-wise missing omics data (denoted as Q(Θ|D)) can be partitioned into two parts

QΘD=EloΘD+ElmΘD =io=1noj=1kriojt(logSXio=jGio;βj+logϕZioXio=j;μj,Σj+logϕ(YioXio=j;γj, σj2))+im=1nmj=1krimjtlogSXim=jGim;βj+logϕYimXim=j;γj, σj2=i=1nj=1krijtlogSXi=jGi;βj+logϕYiXi=j;γj, σj2+io=1noj=1krioj(t)logϕZioXio=j;μj, Σj  (13)

In the M-step, the maximization of βj(t+1), γj(t+1), and σj2(t+1) remains the same as (6), (9), and (10), respectively, since the likelihood components related to those parameters consist of all observations. We only need to replace rij(t) by rimj(t) if i{im}. In contrast, the parameters associated with the omics data, μj and Σj, are updated only based on observations in subset {io}.

2.2.2 Sporadic missing pattern

For sporadic missing omics data, the missing mechanism is ignorable with the assumptions of MCAR or MAR, and the EM algorithm is still applicable. To deal with the sporadic missing pattern in Z, we modify the two-step optimization algorithm for GMM with missing data proposed by Zhang et al. (2021) and integrate it into the EM algorithm for LUCID, as shown in Fig. 2B.

Suppose we have omics data Z={Z1, Z2, ,Zi,, Zn} with sporadic missing pattern, where Zia represents observable variables for Zi and Zib represents missing values. Under the LUCID model, we set the optimization problem as follows:

 argmaxΘ,ZlΘ|D={G, Z, Y} s.t. Zia is fixed for all i (14)

We still use the EM algorithm discussed in Section 2.1 to optimize (14) iteratively. After initializing missing values in Z through imputation methods, the E-step and M-step remain the same at each iteration. After updating Θ, the problem is how to maximize the log-likelihood by imputing the missing part of Z given the observable part of Z fixed. According to (2), optimizing lΘ|D with respect to Z is only related to the likelihood component log ϕZ|μ, Σ. Therefore, the optimization problem is equivalent to

argmaxZi=1nj=1krijtlogϕZi μjt, Σjts.t. Ziaisfixedforall i (15)

Equation (15) can be divided into n sub-problems. Each sub-problem optimizes Zi with fixed rij(t), μjt, and Σj(t). For each observation Zi, we re-index it into observable and missing parts, {Zia, Zib}. We divide cluster-specific mean μj and variance–covariance matrix Σj the same way as Zi, which is shown below:

μj=μja,  μjb, Σj-1=Σjaa-1Σjab-1Σjba-1Σjbb-1 (16)

We then take the partial derivative of (15) in respective to Zib and set it to 0. The closed-form solution is

Zibt+1=j=1krijtpijtΣjbb-1t-1· j=1krij(t)pij(t)Σjba-1(t)μja+Σjbb-1(t)μjb-Σjba-1(t)Zia (17)

where pij(t)=ϕ(Zit|μjt,Σj(t)). Details of deriving (17) can be found in Zhang’s original paper.

Equation (17) implies that missing values in Z are updated at each iteration based on observed values and estimated parameters of GMM. Optimization of (15) is equivalent to a dynamic imputation process inside the LUCID framework. This modified EM algorithm obtains the MLE of Θ and imputes missing values simultaneously. As described by Zhang et al., model parameters Θ and missing values in Z are updated by maximizing the expected likelihood function, which guarantees convergence of this modified EM algorithm.

2.2.3 Combination of both missing patterns

We combine the methods in Sections 2.2.1 and 2.2.2 and extend LUCID to address both list-wise and sporadic missing patterns (Fig. 2C). If observation i has a sporadic missing pattern, we initialize missing values in Zi and treat Zi as “completely observable”. Next, we implement the likelihood partition to handle the remaining observations with a list-wise missing pattern. After calculating Θ(t), we update the missing values Zib using Θ(t). We provide details of the EM algorithm to deal with the combination of list-wise and sporadic missingness in Algorithm 1. To initialize this modified EM algorithm, we use the R package mix to impute the missing values in Z under a general location model (Schafer 2022).

Algorithm 1.

The EM algorithm for LUCID model with one latent variable and missing values in omics data Z

Input: Multi-view data D, total number of iterations tmax, convergence tolerance ϵmax,

1: Initialization:

2: Divide the index i into 3 subsets: (1) ia: individuals with complete observation in Z; (2) ib: individuals with partially observable values in Z; (3) ic: individuals with complete missing values in Z.

3: Initialize Zib0 by mix

4: Initialize μ0, Σ0 by mclust

5: Initialize β0 by nnet

6: Initialize γ0 by GLM

7: Compute the log-likelihood l1 by Equation (13)

8: ϵ  

9: t   

10: EM algorithm:

11: whilet < tmaxor ϵ>ϵmaxdo

12: E-step:

13: Compute riaj(t), ribj(t) based on Equation (4) and ricj(t) based on Equation (12)

14: M-step:

15: Update β(t+1) by Equation (6)

16: Update μ(t+1), Σ(t+1) by mclust

17: Update γ(t+1) by Equation (9)

18: I-step:

19: Update Zib(t+1) by Equation (17)

20: Compute the updated log-likelihood l2 using μ(t+1), Σ(t+1), β(t+1), γ(t+1) according to Equation (13)

21: ϵ  l2-l1

22: l1l2

23: t  t +1

24: end while

25: Compute rij(t) using μ(t), Σ(t), β(t), γ(t)

Output: Θ(t) = {μ(t), Σ(t), β(t), γ(t)}, r(t) and Z(t)

2.3 Software information

The described methods have been implemented in the R package LUCIDus which is available on CRAN (Zhao et al. 2022). The current version of LUCIDus is 3.0.1. LUCIDus can incorporate missing data, perform variable selection, obtain bootstrap confidence intervals, and visualize the LUCID model. It also includes a vignette covering the statistical background and example input data. Our implementation is based on the developer version of LUCIDus, which is available at https://github.com/USCbiostats/LUCIDus.

2.4 Simulation study

To showcase the robustness of the proposed integrated imputation method for handling list-wise missingness in omics data (Z), we performed comprehensive simulation studies across a range of missing ratios and compared the proposed method to other imputation methods in terms of their impact on the performance of the LUCID analysis. We generated 10 000 data points following the defined model in Fig. 1, conditional on pre-specified parameters and K=2 latent clusters characterizing low and high-risk groups. We selected K=2 for the ease of interpretation, but there may exist a more complex structure of the risk groups in the real-data analysis. Due to the conditional independence of the model, we first simulated 10 exposure variables (G). Conditional on G, we generated a cluster variable labeled X. Lastly, four omics variables Z and one outcome variable Y were generated conditional on X. For computational efficiency, we split the 10 000 observations into an 8000 sample training data set and a 2000 sample validation data set. Then, for every simulation iteration a random sample of 2000 observations was drawn from the 8000 sample data set and simulated list-wise missing pattern in Z over a grid of missing ratios. This data set was used as the training data and we analyzed the data using five methods: (1) the updated LUCID imputation framework (“L”); (2) the LUCID model based on a complete-case analysis (“complete-case”); (3) imputation using the location model implemented by the R package mclust (“impute-mclust”) followed by a LUCID analysis; (4) predictive meaning matching implemented by mice (“impute-mice”) followed by a LUCID analysis; and (5) EM imputation implemented by missMethods (“impute-EM”) followed by a LUCID analysis. For each resulting LUCID model, we compared parameter estimates to the simulated truth. Using the G and Z variables from the validation data set, we used the fitted LUCID model to predict cluster assignment and outcomes. We simulated 300 replications and examined several metrics to evaluate the performance of different methods, including mean parameter estimates and corresponding standard deviations compared to true simulated values and the accuracy of clustering using the area under the curves (AUC) by comparing estimated PIP to the known simulated cluster labels of the validation data set.

Though the proposed integrated imputation framework for sporadic missingness is regarded as an auxiliary functionality, we performed simulations studies under the same setting. Since it is infeasible to conduct complete-case analysis for sporadic missingness, the competing method (2) becomes the LUCID model based on the mean imputation (“impute-mean”). See Supplementary section A for results of simulation studies for sporadic missingness.

2.5 Applied data description and availability

We applied LUCID to the “challenge data” from the ISGlobal/ATHLETE “Exposome Data Challenge Event” held in April 2021. This dataset was created by a simulation based on the estimated correlation structure derived from the observed HELIX sub-cohort database. The data are available in the ExposomeDataChallenge2021 repository at https://github.com/isglobal-exposomeHub/ExposomeDataChallenge2021. The HELIX project is a multi-center longitudinal cohort study aimed at exploring the effects of early-life environmental exposures on health (Maitre et al. 2018). HELIX included 1301 mother–child pairs and measured 91 exposures in pregnancy and 116 exposures in childhood (Maitre et al. 2018). Children’s multi-omics profiles (methylome, transcriptome, proteins, and metabolites) were collected, but approximately 9% of the observations did not have complete data from four of the omics layers. Relying on imputing the missing omics data with the proposed method, we implemented LUCID to explore the underlying causal relationships between prenatal hexachlorobenzene (HCB) on childhood BMI with the integration of proteomic measurements. See Supplementary section B for more details on the applied analysis.

3 Results

3.1 Simulation study

Figure 3 shows the simulation results of the list-wise missingness across an increasing missingness ratio in the omics data Z from 0.1 to 0.8. For the exposure effect (the association between G and X), as the missing ratio in omics data increases, the average parameter estimates of L center around the true effect (indicated by the red dashed line) while the average parameter estimates of impute-mice are drastically biased towards 0, especially when the missing ratio is larger than 0.5. Impute-EM, impute-mclust, and impute-mice produce uniformly more biased estimates than L for high missing ratios. Regarding the uncertainty in estimation, standard deviations (SDs) show that uniformly all the methods behave similarly across scenarios (Fig. 3A). For the omics effect (the association between X and Z), L and complete-case consistently yield relatively unbiased estimates even when the missing ratio is high (>0.6), whereas impute-EM, impute-mclust, and impute-mice exhibit biased estimates even at a low missing ratio (>0.2). When the missing ratio is extremely high (0.8), a complete-case is more biased than L. All other methods yield comparable, and considerably smaller SDs than impute-mice. Notably, L demonstrates consistently smaller SDs, particularly when the missing ratio is less than 0.5 (Fig. 3B). Similar trends are observed when estimating the outcome effect (the association between X and Y). Both L and complete-case produce less biased estimates of the outcome effect than other methods, and the SDs of L are smallest across most of the missing ratios (Fig. 3C). L results in an obvious improved model performance in discriminating clusters compared to other methods in the validation set without using the outcome information. When the missing ratio increases, the median AUCs for L remain the most stable and the highest, whereas the median AUCs of complete-case, impute-EM, and impute-mclust drop moderately and the median AUCs of impute-mice drop drastically. The SDs of L is consistently the smallest, particularly when the missing ratio is high (Fig. 3D). In general, impute-EM, impute-mclust, and impute-mice present more biased estimations, relying solely on using the estimated correlation structures of observed rows to impute unobserved rows. While complete-case offers satisfactory estimates under the assumption of MCAR as missing rows are not dependent on G and Y, it remains inferior to L, particularly in scenarios of high missing ratios. This discrepancy can be attributed to L’s effective utilization of information from both G and Y.

Figure 3.

Figure 3.

Simulations results for the list-wise missing pattern. (A) Exposure effect; (B) omics effect; (C) outcome effect; (D) AUC for validation observations. The horizontal dashed line on each plot represents the ground-truth effect.

3.2 Human Early-Life Exposome

For the analysis of the entire dataset, including those with list-wise missing protein data, the supervised LUCID estimates two latent clusters after a grid search for the optimal number of clusters (Model 1). Latent cluster 2 was associated with a higher mean scaled BMI z-score (μBMI, Cluster 1= -1.78, μBMI, Cluster 2= -1.52). Table 1 presents the coefficient estimates for Model 1. Figure 4 presents a histogram visualizing the PIPs for cluster 2. The four bars in the histogram represent an increasing PIP corresponding to cluster 2, which also corresponds to an increasing association with BMI z-score. Each bar is partitioned by the HCB quartiles based on their proportions, and the missingness ratio for each quartile is denoted. The missingness ratios range from 2.56% to 37.50%. In addition, “risk profiles” (signatures of proteomic data) are constructed for observations within each bar by taking a weighted average of risk profiles for latent clusters 1 and 2, with weights determined by PIPs estimated from the LUCID analysis. Approximately 18.91% of the observations have PIPs greater or equal to 0.75 and are characterized by high expression levels of proteins. Most people (74.17%) fall into the first bar with PIPs less than 0.25, characterized by low expression levels of proteins. The two middle bars (6.92%) include individuals characterized by medium expression levels of proteins. Overall, proteomic expression levels increase with the PIPs for cluster 2. The associations between each bar and the z-BMI are also denoted, with an increasing association with the outcome as the PIP increases. Overall, a pattern emerges for the HCB quartiles, with lower quartiles having a higher proportion in the bins representing lower BMI/lower PIP, while higher quartiles have a higher proportion representing higher BMI/higher PIP. This trend is also reflected in Table 1 and Fig. 5, the Sankey diagram of the LUCID model fitted on the entire dataset, where an increasing HCB quartile is associated with latent cluster 2 (ORHCB-second quartile = 1.2, ORHCB-third quartile = 1.71 and ORHCB-forth quartile = 3.19), which is ultimately associated with higher BMI. We also did a complete-case analysis to compare with our results, see Supplementary section C for details.

Table 1.

The detailed coefficient estimates for Model 1.

Cluster 1 Cluster 2
N (total = 1301) 1013 288
Missing ratio 10.66% 7.99%
HCB (exposure), odds ratio
 Quartile 1 Reference level
 Quartile 2 1.22
 Quartile 3 1.71
 Quartile 4 3.19
Proteomics (omics data), scaled mean
 IL-1 beta −0.38 1.24
 IL-6 −0.41 1.35
 IL-8 −0.16 0.52
 Insulin −0.18 0.59
 HGF −0.16 0.52
Z-score of BMI (outcome), mean −1.78 −1.52

Figure 4.

Figure 4.

“Risk” profiles for individuals with and without measured proteomic data. The four bars from left to right on the histogram, each partitioned by the different HCB quartiles, indicate an increasing PIP to cluster 2, and they are also positively correlated with BMI z-score. The missingness ratio for each quartile on each bar is denoted. For each bar, the omics profiles (bar-specific) mean levels of proteomics are also presented.

Figure 5.

Figure 5.

The Sankey diagram of the LUCID model fitted on the whole dataset with missingness. The nodes on the left represent the exposures of HCB quartiles, the middle nodes represent the latent clusters, and the nodes on the right represent the outcome of BMI z-score and proteomics. The width of the links and nodes corresponds to the effect size.

4 Discussion

In this article, under the assumptions of missingness patterns following MCAR and MAR, we develop an approach to handle list-wise missing values in an integrated omics analysis as an extension to the previous LUCID model. We also include an integrated imputation approach for MCAR or at least MAR sporadic missing values as an auxiliary feature. Using an integrated imputation process implemented within an EM algorithm, our proposed method handles list-wise missingness using a likelihood partition method and sporadic missingness by imputing the expected value at each iteration. Simulations showcase the potential advantages of the integrated imputation method for list-wise missingness in omics data in terms of the performance in coefficient estimation and clustering in the LUCID analyses as compared to traditional methods. In the real-data analysis, the integrated imputation method successfully identifies the list-wise missing pattern in the data and handles the missing values accordingly.

One underlying assumption of the LUCID model is that the missing omics data are MCAR or at least MAR for list-wise missingness and sporadic missingness, which implies that the missingness should be systemically related to observed variables such as other omics features, exposures, and the outcome, and cannot be related to unobserved variables. In practice, it is likely for list-wise missingness to be attributed to MCAR or MAR in large cohort omics studies, but it is less common for sporadic missingness to be MCAR or MAR. Sporadic missing values resulting from LOD or other MNAR scenarios remain a potential issue, and one way to mitigate this would be pre-imputing via existing methods appropriate for LOD missingness before analysis. A future potential direction for LUCID is to incorporate the detection limit mechanism for missing values via a truncated normal distribution to model the distribution of omics data.

An additional issue in the application of LUCID is the selection of the number of clusters, k, for the analysis. We have chosen to use BIC as it tends to select more parsimonious LUCID models by considering the increase in the number of parameters that occurs in other components of the LUCID model when the number of latent clusters increases. However, BIC has intrinsic limitations such as sensitivity to prior assumptions, dependence on sample size, and lack of flexibility. Additionally, alternative approaches for choosing the optimal k, such as the Elbow method, Silhouette analysis, and Gap statistic can be used in conjunction with the BIC to help ensure appropriate selection of the number of clusters (Rousseeuw 1987, Thorndike 1953, Tibshirani et al. 2001). However, these traditional approaches are also limited in the context of the LUCID model because they focus on how well the clusters explain the variabilities in the omics data Z given k, ignoring the other components of the LUCID model, and the issue that the number of corresponding parameters in these components changes as k changes. Overall, it is indispensable to conduct multiple analyses while considering the prior biological knowledge and the major research question to determine k, as there is no single best and unbiased approach regarding the LUCID model.

Overall, based on the results from simulations and a real-data analysis, the key advantages of an integrated analysis with the ability to impute missing data are evident. Often, different missing patterns (e.g. list-wise versus sporadic) need different imputation approaches commonly implemented by different software. However, the new implementation of the integrated method in the R package LUCIDus automatically identifies different missing patterns and conducts imputation and LUCID data analysis accordingly (Zhao et al. 2022). This extension enables the original LUCID model to be more versatile, convenient, and accurate when it comes to real-world data applications.

Supplementary Material

vbae123_Supplementary_Data

Contributor Information

Yinqi Zhao, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, United States.

Qiran Jia, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, United States.

Jesse Goodrich, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, United States.

Burcu Darst, Public Health Sciences Division, Fred Hutch Cancer Center, Seattle, WA 98109, United States.

David V Conti, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, United States.

Supplementary data

Supplementary data are available at Bioinformatics Advances online.

Conflict of interest

None declared.

Funding

This work was supported by the National Institutes of Health [U01CA261339, P01CA196569, R01ES030364, P30ES007048, R01ES029944, and U01CA164973].

References

  1. Albert JM, Geng C, Nelson S.. Causal mediation analysis with a latent mediator. Biom J 2016;58:535–48. 10.1002/bimj.201400124 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Baccarelli A, Dolinoy DC, Walker CL.. A precision environmental health approach to prevention of human disease. Nat Commun 2023;14:2449. 10.1038/s41467-023-37626-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Banfield JD, Raftery AE.. Model-based Gaussian and non-Gaussian clustering. Biometrics 1993;49:803–21. 10.2307/2532201 [DOI] [Google Scholar]
  4. Buuren SV, Groothuis-Oudshoorn K.. mice: multivariate imputation by chained equations in R. J Stat Soft 2011;45:1–67. [Google Scholar]
  5. Celeux G, Govaert G.. Gaussian parsimonious clustering models. Pattern Recognit 1995;28:781–93. 10.1016/0031-3203(94)00125-6 [DOI] [Google Scholar]
  6. Derkach A, Pfeiffer RM, Chen TH. et al. High dimensional mediation analysis with latent variables. Biometrics 2019;75:745–56. 10.1111/biom.13053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fraley C, Raftery AE.. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 2002;97:611–31. 10.1198/016214502760047131 [DOI] [Google Scholar]
  8. Goodrich JA, Wang H, Jia Q. et al. Integrating multi-omics with environmental data for precision health: A novel analytic framework and case study on prenatal mercury induced childhood fatty liver disease. Environ Int 2024;190:108930. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Jin R, McConnell R, Catherine C. et al. Perfluoroalkyl substances and severity of nonalcoholic fatty liver in children: an untargeted metabolomics approach. Environ Int 2020;134:105220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Kasper C, Ribeiro D, Almeida AMD. et al. Omics application in animal science—a special emphasis on stress response and damaging behaviour in pigs. Genes (Basel) 2020;11:920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kristensen VN, Lingjærde OC, Russnes HG. et al. Principles and methods of integrative genomic analyses in cancer. Nat Rev Cancer 2014;14:299–313. 10.1038/nrc3721 [DOI] [PubMed] [Google Scholar]
  12. Little R, Rubin D.. Statistical Analysis with Missing Data. 3rd edn. Hoboken, NJ, USA: Wiley, 2019. 10.1002/9781119482260. [DOI]
  13. Lock EF, Hoadley KA, Marron JS. et al. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann Appl Stat 2013;7:523–42. 10.1214/12-AOAS597 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Maitre L, de Bont J, Casas M. et al. Human early life exposome (HELIX) study: a European population-based exposome cohort. BMJ Open 2018;8:e021311. 10.1136/bmjopen-2017-021311 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Maitre L, Guimbaud JB, Warembourg C, Consortium, E. D. C. P. et al. State-of-the-art methods for exposure-health studies: results from the exposome data challenge event. Environ Int 2022;168:107422. 10.1016/j.envint.2022.107422 [DOI] [PubMed] [Google Scholar]
  16. Matta K, Lefebvre T, Vigneau E. et al. Associations between persistent organic pollutants and endometriosis: a multiblock approach integrating metabolic and cytokine profiling. Environ Int 2022;158:106926. [DOI] [PubMed] [Google Scholar]
  17. Mo Q, Wang S, Seshan VE. et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci U S A 2013;110:4245–50. 10.1073/pnas.1208949110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Peng C, Wang J, Asante I. et al. A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits. Bioinformatics 2020;36:842–50. 10.1093/bioinformatics/btz667 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Pierre-Jean M, Deleuze JF, Le Floch E. et al. Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Brief Bioinform 2020;21:2011–30. 10.1093/bib/bbz138 [DOI] [PubMed] [Google Scholar]
  20. Pigott TD. A review of methods for missing data. Educ Res Eval 2001;7:353–83. 10.1076/edre.7.4.353.8937 [DOI] [Google Scholar]
  21. Ritchie MD, Holzinger ER, Li R. et al. Methods of integrating data to uncover genotype–phenotype interactions. Nat Rev Genet 2015;16:85–97. 10.1038/nrg3868 [DOI] [PubMed] [Google Scholar]
  22. Rockel T. missMethods. The Comprehensive R Archive Network (CRAN), 2022. https://cran.r-project.org/package=missMethods.
  23. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 1987;20:53–65. [Google Scholar]
  24. Schafer JL. Analysis of Incomplete Multivariate Data. London, UK: Chapman & Hall/CRC, 1997. 10.1201/9780367803025 [DOI]
  25. Schafer OBJL. Estimation/multiple imputation for mixed categorical and continuous data. The Comprehensive R Archive Network (CRAN), 2022. https://cran.r-project.org/package=EstimationMultipleImputation.
  26. Scrucca L, Fop M, Murphy TB. et al. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 2016;8:289–317. [PMC free article] [PubMed] [Google Scholar]
  27. Shen R, Mo Q, Schultz N. et al. Integrative subtype discovery in glioblastoma using iCluster. PLoS One 2012;7:e35236. 10.1371/journal.pone.0035236 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Shen R, Olshen AB, Ladanyi M.. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 2009;25:2906–12. 10.1093/bioinformatics/btp543 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Song M, Greenbaum J, Luttrell J. et al. A review of integrative imputation for multi-omics datasets. Front Genet 2020;11:570255. 10.3389/fgene.2020.570255 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Song Y, Zhou X, Zhang M. et al. Bayesian shrinkage estimation of high dimensional causal mediation effects in omics studies. Biometrics 2020;76:700–10. 10.1111/biom.13189 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Stratakis N, Conti V, Jin D. et al. Prenatal exposure to perfluoroalkyl substances associated with increased susceptibility to liver injury in children. Hepatology 2020;72:1758–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Tenenhaus A, Philippe C, Guillemot V. et al. Variable selection for generalized canonical correlation analysis. Biostatistics 2014;15:569–83. 10.1093/biostatistics/kxu001 [DOI] [PubMed] [Google Scholar]
  33. Thorndike RL. Who belongs in the family? Psychometrika 1953;18:267–76.. [Google Scholar]
  34. Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc. Series B (Methodol) 1996;58:267–88. [Google Scholar]
  35. Tibshirani R, Walther G, Hastie T.. Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc Series B: Stat Methodol 2001;63:411–23. [Google Scholar]
  36. Wang B, Mezlini AM, Demir F. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 2014;11:333–7. 10.1038/nmeth.2810 [DOI] [PubMed] [Google Scholar]
  37. Wu H, Eckhardt CM, Baccarelli AA.. Molecular mechanisms of environmental exposures and human disease. Nat Rev Genet 2023;24:332–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Yu B, Zheng Y, Alexander D. et al. Genetic determinants influencing human serum metabolome among African Americans. PLoS Genet 2014;10:e1004212. 10.1371/journal.pgen.1004212 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Zhang Y, Li M, Wang S. et al. Gaussian mixture model clustering with incomplete data. ACM Trans Multimedia Comput Commun Appl 2021;17:1. Article 6. 10.1145/3408318 [DOI] [Google Scholar]
  40. Zhao Y, Conti D, Goodrich J et al. LUCIDus: latent unknown clustering integrating multi-view data. The Comprehensive R Archive Network (CRAN), 2022. https://cran.r-project.org/package=LUCIDus.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

vbae123_Supplementary_Data

Articles from Bioinformatics Advances are provided here courtesy of Oxford University Press

RESOURCES