Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Sep 1.
Published in final edited form as: Comput Stat Data Anal. 2024 Apr 30;197:107974. doi: 10.1016/j.csda.2024.107974

Bayesian Simultaneous Factorization and Prediction Using Multi-Omic Data

Sarah Samorodnitsky a,b,*, Chris H Wendt c, Eric F Lock a
PMCID: PMC11210674  NIHMSID: NIHMS1990201  PMID: 38947282

Abstract

Integrative factorization methods for multi-omic data estimate factors explaining biological variation. Factors can be treated as covariates to predict an outcome and the factorization can be used to impute missing values. However, no available methods provide a comprehensive framework for statistical inference and uncertainty quantification for these tasks. A novel framework, Bayesian Simultaneous Factorization (BSF), is proposed to decompose multi-omics variation into joint and individual structures simultaneously within a probabilistic framework. BSF uses conjugate normal priors and the posterior mode of this model can be estimated by solving a structured nuclear norm-penalized objective that also achieves rank selection and motivates the choice of hyperparameters. BSF is then extended to simultaneously predict a continuous or binary phenotype while estimating latent factors, termed Bayesian Simultaneous Factorization and Prediction (BSFP). BSF and BSFP accommodate concurrent imputation, i.e., imputation during the model-fitting process, and full posterior inference for missing data, including “blockwise” missingness. It is shown via simulation that BSFP is competitive in recovering latent variation structure, and demonstrate the importance of accounting for uncertainty in the estimated factorization within the predictive model. The imputation performance of BSF is examined via simulation under missing-at-random and missing-not-at-random assumptions. Finally, BSFP is used to predict lung function based on the bronchoalveolar lavage metabolome and proteome from a study of HIV-associated obstructive lung disease, revealing multi-omic patterns related to lung function decline and a cluster of patients with obstructive lung disease driven by shared metabolomic and proteomic abundance patterns.

Keywords: Bayesian factor analysis, Error propagation, Integrative factorization, Missing data, Multi-omics

1. Introduction

Despite a growing body of statistical methods for multi-omic data analysis, several studies motivate new methodology that can simultaneously (1) identify latent factors that explain variation within or across multi-omic datasets, (2) use the latent factors to predict an outcome, (3) perform missing data imputation, and (4) fully characterize uncertainty in the latent factors, outcome predictions, and imputed data. Our motivation concerns obstructive lung disease (OLD), which remains a frequent comorbidity among individuals living with HIV despite increasing usage of combination antiretroviral therapies (Hirani et al., 2011). We do not have a complete understanding of the molecular factors associated with risk of developing OLD in this population. We aim to use multi-omic data collected from bronchoalveolar lavage (lung) fluid in patients with HIV to characterize patterns of molecular variation within this unique cohort and to relate these patterns to clinical measurements of lung function.

To accomplish these tasks, we consider integrative factorization methods which decompose and partition latent variation across multiple data sources. There are several examples of exploratory integrative factorization methods, both non-Bayesian (Shen et al., 2012; Lock et al., 2013; Yang and Michailidis, 2016; Gaynanova and Li, 2019), and Bayesian (Klami et al., 2013; Chekouo et al., 2017; Argelaguet et al., 2018), which can be used to identify associations between sources in the form of low-rank structured variation. Factors underlying these low-rank structures may reflect biological patterns, though these methods do not explicitly account for a clinical or biological phenotype. A less-explored area is simultaneous factorization and prediction, which can simultaneously estimate factors while also relating them to an outcome. We can incorporate prediction into existing exploratory methods in two steps: (1) identify a small number of latent factors explaining variation in the data, and (2) use these as covariates in a predictive model as in principal components regression (Massy, 1965). This was described in Kaplan and Lock (2017) for multi-omic data, in Samorodnitsky et al. (2022) for multi-omic and multi-cohort data, and in Hellton and Thoresen (2016) for clustering. Recently, one-step simultaneous factorization and prediction procedures have been proposed (Zhang and Gaynanova, 2021; Palzer et al., 2022; Safo et al., 2022), including some Bayesian approaches (Chekouo and Safo, 2023; White et al., 2021). Two-step approaches do not propagate uncertainty associated with the estimated factorization into the predictive model. Additionally, existing one-step approaches do not provide a framework for statistical inference in the estimated factorization and predictive model. Finally, few methods, if any, can accommodate missing values in both the omics sources and outcome and provide full posterior inference for imputed values.

To fill these gaps, we propose Bayesian Simultaneous Factorization (BSF) for exploratory multi-omics integration that estimates a partitioned factorization consisting of joint structure (variation shared across sources) and individual structure (variation specific to each source). BSF can be viewed as an extension of probabilistic matrix factorization (PMF) (Mnih and Salakhutdinov, 2007; Salakhutdinov and Mnih, 2008) with a Gaussian likelihood and conjugate Gaussian priors on the loadings and scores. The posterior mode is the solution to a structured nuclear-norm penalized objective which matches Park and Lock (2020)’s UNIFAC decomposition, a multi-source factorization of joint and individual structures using a nuclear-norm penalty to estimate the structure ranks. Solving this objective achieves rank selection, motivates our choice of prior hyperparameters, and allows efficient initialization of a Gibbs sampling algorithm to estimate the posterior distributions of the factorization parameters. We also propose Bayesian Simultaneous Factorization and Prediction (BSFP), a one-step procedure which extends BSF by estimating factors underlying joint and individual structures while simultaneously relating them to an outcome. BSF and BSFP both offer full posterior inference for the estimated structures, and BSFP properly accounts for uncertainty in the predictive model due to the estimated factorization, offering posterior inference for the latent factors and the predictive model simultaneously. We focus on a continuous outcome, but this is naturally extended to accommodate any Bayesian predictive model and we describe our implementation for a binary outcome in the Supplementary Materials. Finally, BSF and BSFP can be used for multiple imputation, including in “blockwise” missing scenarios in which an entire sample from a source is unavailable, and offer full posterior inference for the imputed values in both the omics sources and, in the case of BSFP, the outcome-of-interest.

The remainder of our article is organized as follows: in Section 2, we review PMF, UNIFAC, and introduce BSF and BSFP. In Section 3, we compare BSFP to existing one- and two-step approaches to factorization and prediction via simulation and assess the imputation accuracy of BSF against other approaches. In Section 4, we describe applying BSFP to predict lung function using proteomic and metabolomic data to study HIV-associated OLD. We conclude with a discussion of the methodology and potential new directions.

2. Methods

2.1. Notation

We first introduce notation used throughout. Bold, uppercase letters, e.g. X, denote matrices. Bold, lowercase letters, e.g. y, denote vectors. Unbolded uppercase and lowercase letters, e.g. R and q, denote scalars. For illustration, consider q data sources, e.g. omics datasets, measured on n samples. Let Xs:ps×n represent source s, containing ps biomarkers, oriented such that the columns are the samples. X1,,Xq are linked by their columns, meaning they contain biomarkers measured on a shared set of n samples. We use Xs[j,i],j=1,,ps and i=1,,n to denote expression of the jth feature by the ith sample in source s. Similarly, Xs[j,] represents the jth feature across all samples (the jth row of Xs) and Xs[,i] represents the expression values for the ith sample (the ith column of Xs). We use subscripts and unbolded letters to index elements within a vector, e.g., yi is the ith entry in y. We let X=(X1TXqT)T denote the column-concatenated, full data matrix containing biomarkers from all q sources such that R=rank(X). Let p=s=1qps represent the total number of observed biomarkers. We define the squared Frobenius norm, F2, of Xs as j=1psi=1nXs[j,i]2, i.e. the sum of squared entries in Xs, and the nuclear norm, , of Xs as k=1Rsσk(Xs) where σk(Xs) denotes the kth singular value of Xs and Rs=rank(Xs).

2.2. Review of Probabilistic Matrix Factorization (PMF)

Before describing our proposed method in Section 2.4, we review probabilistic matrix factorization (PMF) in this section and UNIFAC in Section 2.3. Consider a single real-valued matrix, Xp×n. PMF is a Bayesian linear factor model, in which observations in X are assumed to be driven by a small number, r<rank(X), of latent factors or components. These latent factors are contained in a matrix Vn×r, termed scores, which are mapped to the space spanned by the observed features or biomarkers by matrix Up×r, termed loadings. Assuming X=UVT+E, then UVT:p×n is a low-rank approximation to the observed X where Var(E[j,i])=σ2 for j=1,,p,i=1,,n. We refer to UVT as a structure, as it contains structured variation underlying X. PMF imposes the following conditional likelihood on the observed entries in X given U, V, and σ2:

XU,V,σ2j=1pi=1nNormal(X[j,i]U[j,]V[i,]T,σ2), (1)

where Normal(,) represents the density of the univariate Gaussian distribution with mean U[j,]V[i,]T and variance σ2. Mnih and Salakhutdinov (2007) impose mean-zero, Gaussian priors on the factorization components, U and V:

UσU2j=1pMultivariate-Normal(U[j,]0,σU2Ir×r)andVσV2i=1nMultivariate-Normal(V[i,]0,σV2Ir×r), (2)

where Ir×r is an r×r identity matrix and σU2,σV2>0.

2.3. Review of UNIFAC

Whereas PMF applies to a single matrix, the UNIFAC method was developed as a simultaneous low-rank decomposition of multiple matrices (e.g., multi-omic data) using random matrix theory. Consider q omics sources, X1,,Xq as defined in Section 2.1, row-centered to have mean 0. The UNIFAC decomposition is as follows:

X=S+E=J+A+E=UVT+WVT+E=(U1Uq)VT+(W1Wq)(V1TVqT)+(E1Eq), (3)

where J1,,Jq are matrices of rank r<R and A1,,Aq are matrices of rank rs<Rs. Js contains joint structure, latent expression patterns shared by all sources, as it is reflected in source s. Individual structure, As, contains latent expression patterns unique to source s. Es reflects Gaussian noise with variance σs2 not captured in the decomposition. Js is decomposed into V:n×r, the joint scores, which contain latent factors expressed by the samples in all sources, and the joint loadings, Us:ps×r which maps these factors to the observed biomarkers in source s. Similarly, As is decomposed into Vs:n×rs, the individual scores, which contain the latent factors unique to each source, and the individual loadings, Ws:ps×rs, which map the factors to the feature space spanned by the biomarkers in source s. Park and Lock (2020) (and the extension described in Lock et al. (2022)) propose estimating the UNIFAC decomposition in Equation 3 by minimizing the following structured nuclear-norm penalized objective:

{J^s,A^ss=1,q}=min{Js,As}s=1q12s=1qXsJsAsF2+λJ+s=1qλsAs, (4)

where J^s=U^sV^T and A^s=W^sV^sT. Equation 4 is convex and minimized using an iterative soft singular value thresholding algorithm on the singular values of the structures. The algorithm cycles through fixing the individual structures at their most up-to-date values and estimating the joint structure and vice versa. At each iteration, the value of each structure is estimated by minimizing the squared Frobenius norm difference between the observed data and the structure plus a nuclear norm penalty on the structure being estimated. The solution to this sub-problem is given by a soft-thresholding operator on the singular values of the observed data, which retains the left- and right-singular vectors, i.e., the factors, but shrinks to 0 any singular values smaller than λ (for joint structure) or λs (for individual structure). Thus, rank selection for the joint and individual structures is a function of the tuning parameters, λ and λs. Fixing λs=n+ps is a reasonable choice because it provides a tight upper bound on the largest singular value of the error, Es, assuming the sources have unit error variance, σs2=1 (Rudelson and Vershynin, 2010). This effectively retains structure driven by components describing biological or technical variation in the data and which are not attributed to error or random noise. This choice also meets the requirements established in Park and Lock (2020) for a uniquely identifiable and non-zero decomposition, which we discuss further in Section 2.6. The penalty on the joint structure, λ, is fixed to λ=p+n by an analogous argument, as E is also a mean-zero Gaussian random matrix. Prior to estimating the decomposition, we scale the sources to have unit error variance by dividing each by its estimated error standard deviation calculated by the median absolute deviation (MAD) estimator given in Equation 47 in Gavish and Donoho (2017).

2.4. Bayesian Simultaneous Factorization (BSF)

The resulting factorization from UNIFAC is also the mode of a Bayesian posterior that naturally extends the PMF model, which we leverage for our Bayesian Simultaneous Factorization (BSF) model. Theorem 1 in Park and Lock (2020) establishes the equivalence between minimizing the nuclear norm objective (4) and minimizing a similar objective with matrix-defined L2 penalties (Frobenius norms) on the scores and loadings:

{U^s,V^,W^s,V^ss=1,q}=min{Us,V,Ws,Vs}s=1qs=1qXsUsVTWsVsTF2+λ(UF2+VF2)+s=1qλs(WsF2+VsF2) (5)

Further, Equation 5 is proportional to the log-posterior for a Bayesian model with Gaussian errors and Gaussian priors on the scores and loadings:

XU,V,W,Vs=1qi=1nj=1psNormal(Xs[j,i]Us[j,]V[i,]T+Ws[j,]Vs[i,]T,1), (6)

and

Us[j,]Normal(0,λ1Ir×r),V[i,]Normal(0,λ1Ir×r),Ws[j,]Normal(0,λs1Irs×rs)andVs[i,]Normal(0,λs1Irs×rs) (7)

Prior to model fitting, we row-center the features and scale the sources to have error variance 1 using the MAD estimator. Then, the prior variances for U, V, V, W are fixed at the penalties described in Section 2.3 so that the posterior mode of the decomposition of X matches the UNIFAC decomposition. We then apply the iterative soft singular value thresholding algorithm of UNIFAC to identify the posterior mode and initialize a Gibbs sampling algorithm to sample from the full posterior distributions of U, V, W, and V.

The factorization for each low-rank term in the decomposition (e.g, UVT) corresponds to a PMF model with σU2=σV2, but the connection to UNIFAC provides several advantages. First, note that equality of variances does not restrict the model, as their scales are not independently identifiable (see Section 2.6) due to the dependence of Us scale on Vs. Moreover, these tuning parameters and the ranks are conveniently fixed via the random matrix theory as discussed in Section 2.3. The tuning parameters are set to λ=p+n and λs=ps+n and the ranks are induced by solving the UNIFAC objective with these penalties. As described in Salakhutdinov and Mnih (2008), the choice of tuning parameters can dramatically impact model results, and using cross validation with multiple sources of data is not straightforward (Owen and Perry, 2009). Lastly, an efficient singular value thresholding algorithm can be used to find the mode, circumventing issues of convergence due to poor initialization in Gibbs sampling.

2.5. Bayesian Simultaneous Factorization and Prediction (BSFP)

Now, suppose we have a continuous phenotype y in addition to X1,,Xq for a shared cohort of n samples. We are interested in predicting y using the q sources. We extend the BSF model to include prediction of y (referred to as BSFP) by assuming the following relationship between the factors, V and Vs for s=1,,q, and y:

y=Vβ+ey=β0+Vβjoint+s=1qVsβindiv,s+eyandeyNormal(0,τ2In×n), (8)

where

V=(1nVV1Vq)andβ=(β0βjointTβindiv,1Tβindiv,qT)T (9)

We use the following priors in our model for y:

βNormal(0,Σβ)andτ2Inverse-Gamma(a,b), (10)

where Σβ=diag{α02,α2Ir×r,α2Ir1×r1,,α2Irq×rq} and the hyperparameters, a, b, α02, and α2 are fixed constants. In our study of HIV-associated OLD (Section 4), we fixed α02=10002 and α2=1. α02 was chosen to be uninformative and α2 was chosen to reflect the anticipated effect size of the factors. This model may be modified to suit the characteristics of a given outcome; for example, we have implemented an analogous model with a probit link for a binary outcome and applied it to gene expression, methylation, and miRNA data from the Cancer Genome Atlas in our Supplementary Materials (see Sections 2 and 5). We initialize as in the BSF model, and infer the full posterior for U, V, W, V, τ2, and β via Gibbs sampling, which we describe in more detail in the Supplement. Inferring the factorization and prediction model simultaneously confers two advantages. First, it incorporates supervision by y for the latent structures, which may yield more phenotypically-relevant factors. Second, posterior uncertainty in these underlying factors is propagated through to the predictive model.

We treat y distinctly from the other q sources because as a vector it does not have low-rank structure. We estimate the error variance in y, τ2, explicitly, while we fix the error variance in X to 1 after scaling. Simulations showed that initializing with y and scaling y by its estimated error standard deviation, as is done with each Xs, yielded no improvement in recovery of the underlying structure or in prediction (see Supplementary Materials Section 3.3 for more information).

2.6. Identifiability

The joint and individual structures J and A in the posterior mode of the decomposition of X, which matches the solution to the nuclear-norm penalized objective in Equation 4, are uniquely identified by Theorem 1 of Lock et al. (2022). This theorem provides sufficient conditions for identifiability: the decomposition must minimize Equation 4 with λ and λs fixed as in Section 2.3, and for each source, s=1,,q, the columns of the loadings and scores in the joint and individual structures must be linearly independent. In general, the decomposition is not unique, as it is possible that J(i)+A(i)=J(j)+A(j) with J(i)J(j) and A(i)A(j); however, posterior samples will favor more efficient decompositions (with respect to the regularization induced by λ and λs) that concentrate near the unique posterior mode.

A challenge in Bayesian factor models like ours is that the loadings and scores are not identifiable due to rotation, permutation, and sign invariance. Under rotation invariance, UVT, for example, is unchanged if we right-multiply U and V by an orthogonal r×r matrix P, i.e. UP(VP)T=UPPTVT=UVT. Under permutation invariance, the columns in U and V can be reordered and yield the same decomposition. Likewise, under sign invariance, the signs in the columns of U and V can be swapped. Rotation, permutation, and sign invariance obstruct interpretation of posterior summaries of the Gibbs samples for U, V, W, V , and β. To address all three sources of non-identifiability, we use the MatchAlign algorithm (Poworoznek et al., 2021) which first orthogonalizes the loadings to load the observed features onto one or a few factors. Then, using a greedy matching algorithm and pivot, factors are iteratively matched to the positively- or negatively-signed pivot columns for which the L2-normed difference is minimized.

We now describe our adaptation of the MatchAlign algorithm to accommodate multiple structures and our predictive model. The MatchAlign algorithm proceeds as follows: at each Gibbs sampling iteration after burn-in, t=Tburn-in,,T, define U,β(t)=(U(t)Tβjoint(t))T and Ws,β(t)=(Ws(t)Tβindiv,s(t))T for s=1,,q. For each t, we apply a Varimax rotation, yielding U,β(t) and Ws,β(t) for s=1,,q. The resulting rotation is also applied to the scores, yielding V(t) and Vs(t) for s=1,,q. For pivot, we use U,β(0)=(U(0)Tβjoint(0))T and Ws,β(0)=(Ws(0)Tβindiv,s(0))T, where U(0), βjoint(0), Ws(0), and βindiv,s(0) are chosen as the posterior sample with the median condition number after burn-in. The median condition number is defined as the posterior sample with the median largest singular value. (Poworoznek et al., 2021) Taking U,β(t) as an example, we start with the column with the largest norm and calculate the normed difference between each column in the pivot under positive and negative signage, U,β(0)[,k] and U,β(0)[,k], k=1,,r. We match this column in U,β(t) to signed U,β(0)[,k] with k that yields the smallest norm. We proceed similarly with Ws,β(t) for s=1,,q. The scores are permuted and signed appropriately to match. For more information on the performance of the MatchAlign algorithm, we refer readers to Poworoznek et al. (2021).

2.7. Multiple Imputation

The Gibbs sampling algorithm for estimating the posterior of the latent variation structure and regression coefficients in BSFP naturally accommodates multiple imputation of missing values in X and y. We use the most up-to-date posterior samples to impute missing values at each iteration of the sampler. This can be done even when an entire sample is unobserved in a source (referred to as “blockwise” missingness). Imputations are generated from the posterior predictive distribution and can be studied using standard posterior summaries.

Let 𝐼s(m)={(j,i):Xs[j,i]missing} denote the set of bivariate indices for which entries in the source Xs are not observed. At the tth iteration of the Gibbs sampler, we impute these entries by randomly generating for (j,i)𝐼s(m):

Xs[j,i]=Us(t)[j,]V[i,](t)T+Ws(t)[j,]V[i,]s(t)T+Es[j,i], (11)

where Es[j,i]Normal(0,1). Missing values in y may be imputed in a similar manner. Let 𝐼y(m)={i:yimissing}. For i𝐼y(m), we may impute these values at iteration t using:

yi=V(t)[i,]β(t)+ei, (12)

where eiNormal(0,τ2(t)).

3. Simulations

We consider two simulation studies to characterize the performance of BSF and BSFP. In Section 3.1, we compare BSFP to existing one- and two-step approaches for factorization and prediction. In Section 3.2, we compare BSF to existing single-imputation approaches using simulated multi-source datasets.

3.1. Model Comparison

We examine the ability of BSFP to recover latent variation structure and predict a continuous outcome under varying levels of signal-to-noise (s2n). We define s2n as the ratio of the variance in the latent structure (J+A) to the variance in the error (E), i.e., s2n=Var(vec(J)+vec(A))/Var(vec(E)). We generated q=2 sources of data with 100 features each on n=200 samples, which were then split into a training and test set, 100 samples apiece, denoted Xtrain, ytrain, Xtest, and ytest. The true overall rank of the latent structure, the sum of the ranks of the joint and individual structures, was 3, where r=1 and rs=1 for s=1,2. We generated X according to the decomposition in Equation 6 where the entries of U, V, V, and W were generated iid from a Normal(0,1) distribution. We generated y from the model in Equation 8 where an intercept was generated from a Normal(0,10) distribution and βjoint, βindiv,s were generated iid Normal(0,1). Random noise in X and y were generated iid Normal(0,1). We scaled X and y to have s2n ratios 9, 3, 1, and 13 and considered all 16 combinations of s2n. These s2n levels were chosen to reflect realistic signal levels in omics data. In practice, we expect an average s2n of approximately 1, and the levels 9, 3, and 13 help illustrate how the methods perform with especially-high and especially-low signal in the data.

We considered UNIFAC, JIVE (Lock et al., 2013), MOFA (Argelaguet et al., 2018), sJIVE (Palzer et al., 2022), BIP (Chekouo and Safo, 2023), multiview (Ding et al., 2022), and IntegratedLearner (Mallick et al., 2023) as alternative methods. BIP and sJIVE perform simultaneous factorization and prediction and are the most natural comparisons to BSFP. For UNIFAC, JIVE, and MOFA, which do not directly accommodate prediction, we treated the estimated factors as fixed covariates in a Bayesian linear model for y as described in Section 2.5, following the two-step approach described in Section 1. UNIFAC was also an important comparison, as it matches the posterior mode of the data-decomposition in BSFP but error is not propagated from the estimated factorization to prediction. The multiview method does not estimate an underlying factorization to use within its predictive model. Instead, it uses the observed features in a predictive model for each source and encourages “agreement” between the predictions across datasets. The IntegratedLearner framework also does not estimate an underlying factorization, but constructs a weighted average of predicted outcomes based on each source using a user-selected learner. We considered Bayesian additive regression trees (BARTs) and the lasso as the per-source learners and aggregated the predictions using stacking and weights estimated by non-negative least squares.

We compared how well the models recovered the underlying structure in X and how well each predicted a held-out y on the test data. The proposed model was fit on the full training and test X with only access to ytrain. We used UNIFAC, JIVE, and MOFA to estimate the underlying structure on the full X. The scores corresponding to the test set were then used as covariates to predict ytest. sJIVE and BIP were trained on Xtrain and ytrain, as these methods do not accommodate prediction of unobserved outcomes. We then used the estimated loadings to predict ytest using Xtest . We ran each model under each s2n combination for 100 replications and compared their recovery of the underlying structure using the relative squared error (RSE):

RSE(S,S^mod)=SS^modF2SF2, (13)

where S{J,A} reflects the true joint or individual structure, respectively, and S^mod{J^mod,A^mod} reflects the estimated joint or individual structures from model, mod. An RSE close to 0 suggests better performance. With the exception of UNIFAC, we also consider each method with ranks fixed at the truth. Since the multiview and IntegratedLearner methods do not estimate latent structure, we will not include them in our discussion of how well the compared methods recover underlying structure. We calculate coverage of the truth for BSFP, UNIFAC, JIVE, MOFA, and IntegratedLearner with BARTs using 95% credible intervals. We do not calculate coverage for sJIVE, BIP, multiview, and IntegratedLearner with lasso because these methods do not offer full posterior inference for the predictive model. In the Supplementary Materials, we include rank selection results for BSFP, UNIFAC, sJIVE, MOFA, JIVE, and BIP. Additional discussion on rank selection in integrative factorization is given in Palzer et al. (2022) and Gaynanova and Li (2019).

The RSEs averaged across simulation replications for recovery of the underlying joint and individual structures is shown in Figure 1 on a log-scale. BSFP, UNIFAC, MOFA, BIP, and sJIVE all performed similarly well in recovering the joint and individual structure, even compared to their performances given the true ranks. With high signal, JIVE overestimated the ranks, leading to poor estimation of the structure. The average relative RSE for recovery of E(ytestX) is shown in Figure 2 on a log-scale for s2ns 9 and 1/3. All models performed comparably well, with BSFP, BIP, and sJIVE showing the smallest median RSE and lowest variability under high signal in X and y. This suggests a unique benefit to one-step approaches under certain conditions, though BSFP and the comparison approaches provided good prediction accuracy across a range of signal levels in the data and response. The multiview method was likely less competitive across these conditions because it does not explicitly estimate an underlying factorization. While the one-step approaches to factorization and prediction yielded comparable prediction accuracy to existing two-step approaches, the greatest benefit to a one-step approach is illustrated most clearly below.

Figure 1:

Figure 1:

Comparing BSFP to existing methods on how well each recovers underlying joint and individual structure based on relative squared error (RSE). RSE values closer to 0 reflect better performance. Each panel reflects a different signal-to-noise (s2n) ratio in X. We do not differentiate according to s2n in y as results did not vary under differing levels of signal in the response. For each method and s2n, a boxplot shows the distribution of RSE values across replications; the middle line of the boxplot is the median RSE, and the upper and lower edges give the interquartile range.

Figure 2:

Figure 2:

Comparing BSFP to existing methods on how well each recovers the E(ytestX) based on relative squared error (RSE). RSE values closer to 0 reflect better performance. Each panel reflects a different signal-to-noise (s2n) ratio in X and ytest. We select only the highest and lowest s2n ratios for space considerations. For each method and s2n , a boxplot shows the distribution of RSE values across replications; the middle line of the boxplot is the median RSE, and the upper and lower edges give the interquartile range.

The most important benefit of BSFP is apparent when studying posterior coverage of E(ytestX), shown in Figure 3 for s2ns 9 and 1/3. We expect to see coverage around 95%, which BSFP achieves across signal-to-noise levels while UNIFAC, JIVE, and MOFA fall short, even under high signal in the data and outcome. This is because BSFP propagates error from estimation of the underlying structure to the predictive model, yielding nominal coverage rates. This advantage is especially apparent when the signal in X is low, leading to higher uncertainty in estimating the structure. BSFP marginalizes over this uncertainty when estimating the posterior distribution for E(ytestX). IntegratedLearner with BARTs also has this advantage, though instead of propagating uncertainty from the estimated latent structure, it propagates uncertainty from the underlying regression trees to the ensemble predictions. UNIFAC, JIVE, and MOFA treat the identified factors as fixed, assuming there is no associated uncertainty. This affects our confidence in the predicted values for ytest and can affect posterior inference for the estimated regression coefficients. Despite similar performance in recovery of the underlying structure and prediction accuracy for ytest, posterior inference on E(ytestX) is unreliable when uncertainty in the estimated factorization is not properly accounted for. This suggests BSFP provides comprehensive posterior inference across varying levels of signal in the data and response. We show here results for the highest and lowest s2ns for space considerations but provide results for all s2ns in the Supplementary Materials.

Figure 3:

Figure 3:

Comparing BSFP to existing methods on coverage of E(ytestX) under the posterior. Each panel reflects a different signal-to-noise (s2n) ratio in X and ytest. Coverage was assessed using 95% credible intervals. We select only the highest and lowest s2n ratios for space considerations. For each method and s2n, a boxplot shows the distribution of coverage values across replications; the middle line of the boxplot is the median coverage, and the upper and lower edges give the interquartile range.

3.2. Missing Data Imputation

We then studied the imputation performance of BSF. Here, we do not consider prediction of an outcome and focus on how well each method imputes observations randomly removed from each Xs. We generated q=2 sources of data, measured on n=100 samples, each with ps=100 features, for s=1,2. Data were generated in the same manner as described in Section 3.1. We considered three types of missingness: (1) entrywise, in which 10% of entries in each source were randomly removed, (2) blockwise, in which 10 non-overlapping samples from each source were randomly removed, and (3) missingness-not-at-random (MNAR), in which the lowest 10% of samples in each source were removed to reflect a global, source-specific limit-of-detection. We varied the s2n across 9, 3, 1, and 1/3. We considered two settings for the true ranks: in setting 1, the overall rank was 15 where r=rs=5 for s=1,2, and in setting 2, the overall rank was 3 as in Section 3.1. We focus on results from setting 1 and provide a discussion of setting 2 in the Supplement. For each condition (missingness type, s2n, and rank of the underlying structure), we considered 100 replications and averaged the results across sources and replications.

We compared BSF to five other single imputation approaches, including mean imputation and UNIFAC. We also compared to the iterative SVD imputation algorithm given by Fuentes et al. (2006) (referred to as the SVD) with a rank of 4, k-nearest neighbors (kNN) (Kowarik and Templ, 2016), and random forest (RF) (Stekhoven and Bühlmann, 2012). We applied the SVD, kNN, and RF to each source separately and to the sources combined. We evaluated the performance using the relative squared error of the imputed values compared to the true unobserved values, i.e. RSE(Xs,X^smod)={Xs[j,i]|(j,i)𝐼s(m)}{X^smod[j,i]|(j,i)𝐼s(m)}F2/Xs[j,i](j,i)𝐼s(m)F2 where X^smod[j,i] denotes the imputation for Xs[j,i] given by model mod. For BSF, the RSE was calculated using the posterior mean of the imputed values across Gibbs samples. In the Supplement, we discuss uncertainty in multiple imputation when applying BSF under different missingness mechanisms and signal levels.

Imputation accuracy results under an assumed overall rank of 15 (i.e., r=r1=r2=5) are shown in Table 1. Models were compared using a t-test to assess whether performance was significantly different at the 0.05 level within each condition. Under high signal (s2n=9 and 3), both BSF and UNIFAC excel at predicting the unobserved values, with RF closely following in terms of performance.

Table 1:

Average RSEs for imputing missing values under varying levels of signal and missingness mechanisms under an assumed rank of 15. Bolded values correspond to the best-performing method under the given condition. If more than one value is bolded, model performance was not significantly different from the best-performing model.

s2n = 9
s2n = 3
s2n = 1
s2n = 0.333
Model Entrywise Blockwise MNAR Entrywise Blockwise MNAR Entrywise Blockwise MNAR Entrywise Blockwise MNAR

BSF 0.226 0.638 0.572 0.460 0.764 0.871 0.790 0.917 1.361 0.977 0.992 1.262
UNIFAC 0.219 0.663 0.542 0.435 0.779 0.806 0.740 0.914 1.335 0.951 0.986 1.279
Mean Imputation 1.011 1.012 1.426 1.011 1.012 1.418 1.011 1.011 1.654 1.011 1.011 1.491
SVD (Combined) 0.634 0.992 0.835 0.711 1.023 0.983 0.840 1.077 1.394 0.980 1.166 1.403
SVD (Separate) 0.559 0.610 0.653 0.816 0.815 1.354 0.993 1.405
kNN (Combined) 0.729 1.022 0.889 0.854 1.095 0.986 1.050 1.200 1.388 1.213 1.274 1.410
kNN (Separate) 0.685 1.529 0.840 0.830 1.461 0.957 1.045 1.379 1.388 1.216 1.334 1.411
RF (Combined) 0.448 0.773 0.950 0.602 0.840 1.095 0.816 0.936 1.564 0.968 1.004 1.491
RF (Separate) 0.396 1.070 0.872 0.559 1.054 1.036 0.790 1.038 1.539 0.965 1.031 1.487

Under blockwise missingness, methods that do not estimate an underlying factorization and methods applied to each source separately are not expected to perform well, as there is no information available to predict the missing samples. In fact, the SVD algorithm would not run under this condition, denoted by blanks in Table 1. Under high signal, BSF, UNIFAC, and RF applied to the sources combined were the only methods yielding RSEs below 1 as they were able to impute values other than 0. As in the entrywise missingness case, these gains in performance dissipated as the signal in the data decreased.

Under MNAR and high signal, BSF, UNIFAC, and SVD applied to each source separately provided the best predictive accuracy. This is reasonable given recent results supporting the use of low-rank factorization methods to impute missing values under the MNAR assumption (Wang et al., 2021), and underscores the non-viability of methods like kNN and RF in this case.

4. Application to HIV-OLD Study

We applied BSFP to metabolomic and proteomic data collected from the Vancouver and Pittsburgh lung cohorts as part of a matched case-control study on HIV-associated obstructive lung disease (OLD) (Cribbs et al., 2016; Akata et al., 2022). We are interested in predicting lung function based on the metabolome and proteome in bronchoalveolar lavage lung fluid (BALF). In this study, 26 cases (those with OLD) were matched to 26 non-OLD controls based on age, antiretroviral treatment status, and smoking status. Lung function was measured as percent predicted forced expiratory volume in 1-second (FEV1pp). Our dataset contained 252 BALF metabolites and 4253 BALF proteins, which we used to predict FEV1pp. Prior to model fitting, we log-transformed the features in both datasets to normalize their distributions.

We first compared BSFP to sJIVE, JIVE, UNIFAC, MOFA, multiview, and IntegratedLearner under cross validation where we iteratively held out FEV1pp for each case-control pair. Due to computational barriers, we were unable to fit BIP. We fit BSFP using 5000 posterior samples with a 2500 sample burn-in, which was deemed sufficient based on monitoring of trace plots and the log-joint likelihood of the model. Each model was fit in the same manner as in Section 3. We also considered the lasso regression model (Tibshirani, 1996) on the combined metabolomic and proteomic data and on each separately. To fit the lasso model, we held-out the FEV1pp values and the metabolomic and/or proteomic observations for each case-control pair and trained the model on the remaining samples. We compared the correlation between the predicted and true held-out FEV1pp values. For BSFP, we used the posterior mean of the predicted FEV1pp values. The correlations and their p-values given by a Pearson correlation test were as follows: BSFP (0.4847, p = 0.0003), UNIFAC (0.4787, p = 0.0003), sJIVE (0.4425, p = 0.0010), IntegratedLearner (stacked BART) (0.4299, p = 0.0015), lasso (combined sources) (0.4029, p = 0.0031), multiview (0.4084, p = 0.0027), lasso (metabolite only) (0.3972, p = 0.0035), IntegratedLearner (stacked lasso) (0.3808, p = 0.0054), MOFA (0.3719, p = 0.0066), lasso (protein only) (0.3157, p = 0.0226), and JIVE (0.3131, p = 0.0238). BSFP and UNIFAC had comparable predictive performance, though BSFP has the advantage of providing a framework for uncertainty in the estimated factorization. In addition, models that distinguished between the joint and individual factors generally performed better, and considering both sources yielded better prediction than using only the metabolomic or proteomic data.

We now focus on the results from BSFP, which identified 14 joint, 11 metabolite-specific, and 16 protein-specific factors for a total rank of 41. It took approximately 2.3 hours to fit BSFP to this dataset on a single core of a Linux desktop with a 4.0 GHz i7–6900K processor. We visualize the estimated structures via heatmap (Figure 4), where the samples (columns) are ordered by FEV1pp. The joint structure, which explained 17.7% of variation in the metabolome (95% CI=(16.3%,18.8%)) and 21.3% in the proteome (19.9%, 22.3%) reveals a sample cluster driven by shared metabolomic and proteomic expression, highlighted in orange. The individual structures explained 54.6% (53.6%,55.5%) in the metabolome and 61.8% (60.9%,62.6%) in the proteome. The proportion of variance in FEV1pp explained by the joint factors was 2.3% (0.4%,6.5%), while the metabolomic factors explained 1.2% (0.3%,3.0%) and the proteomic factors explained 6.3% (1.5%,14.5%). We visualize the accuracy of the fitted FEV1pp vs. observed and the associated uncertainty in fitted FEV1pp in Figure 5. Observed FEV1pp was heterogeneous, ranging from 21 to 128% of predicted normal. The fitted FEV1pp ranged from 73 to 93% of predicted normal, reflecting the challenge of capturing the heterogeneity in a small sample. Posterior 95% credible intervals reflect this uncertainty. Seven samples colored in orange correspond to those which clustered together based on shared proteomic and metabolomic patterns in Figure 5. Clustering was done using k-means with k=2 on the joint structure across the posterior sampling iterations. Our choice of k was motivated by previous analyses of data from this study in which K-means clustering with k=2 separated this unusual cluster from the rest of the cohort (Samorodnitsky et al., 2023). These clustered together over 90% of posterior sampling iterations and had lower fitted FEV1pp than the other 45 samples. This cluster reflects a multi-omic molecular subtype in the lung that is associated with poor lung function; further molecular association with FEV1pp are uncertain.

Figure 4:

Figure 4:

Heatmap of the posterior mean of estimated joint and individual structures. Columns represent samples and rows represent proteins or metabolites. Samples are ordered by FEV1pp. Blue values reflect lower expression and red values reflect higher expression relative to the rowwise mean. The orange rectangle highlights a cluster of samples with low FEV1pp driven by joint metabolomic and proteomic expression.

Figure 5:

Figure 5:

Plot of fitted vs. observed FEV1pp for each of 52 samples with 95% credible intervals. Those colored in orange correspond to the samples which clustered together based on joint metabolomic and proteomic expression.

We applied the alignment algorithm described in Section 2.6 to study the estimated factors using posterior summaries. Alignment results were similar when we considered matching to posterior samples around the chosen pivot. We visualize the posterior summaries of all factors in a Shiny app at https://sarahsamorodnitsky.shinyapps.io/BSFP_HIV_OLD/. We discuss here the joint factor which explained the most variation within the joint structure, which we refer to as “Joint Factor 1”. This factor is associated with the previously-identified cluster, based on the distinct scores for these samples. The loadings for 61/252 metabolites and 666/4253 proteins had 95% credible intervals which did not contain zero, suggesting they contribute “significantly” to this factor. We used these biomarkers in a pathway analysis using IMPaLA (Kamburov et al., 2011). The top pathways were neutrophil degranulation (FDR-adjusted p-value <2e6) and innate immunity (FDR-adjusted p-value <2e6), both of which are pertinent to OLD. Neutrophil activation and subsequent inflammation is a hallmark of the disease (Herrero-Cervera et al., 2022). In addition, there is evidence that multiple host defense mechanisms, including innate immunity, play a role in OLD pathogenesis (Agustí and Hogg, 2019).

The metabolomics and proteomics datasets did not contain missing values. However, in the Supplementary Materials we consider artificially inducing entrywise and blockwise missingness to examine how well each dataset can be used to impute missing values in the other.

5. Discussion

In this article, we proposed a Bayesian framework for exploratory integrative factorization (BSF) and simultaneous factorization and prediction (BSFP) of multi-omic data. BSF and BSFP provide a complete framework for uncertainty in the estimated factorization, predictive model (for BSFP), and imputed values.

We showed via simulation the importance of accounting for uncertainty in the estimated factors during prediction. Otherwise, posterior inference in the predictive model may be underconservative, especially when signal in the omics sources is low (but even when signal is high). We did not find that incorporating an outcome into the estimation of the factorization yielded improved accuracy in recovering the underlying structure. This is consistent with previous work (Palzer et al., 2022) which showed that incorporating an outcome into the latent factorization estimation yielded only modest benefits. We found the greatest benefit to simultaneous factorization and prediction was in uncertainty quantification in the latent factorization and predictive model. Additionally, our simulations suggested BSF can be used under various missingness patterns. When observations were randomly removed from each source, BSF was competitive against existing methods for imputing unobserved values. Under blockwise missingness and MNAR, BSF yielded gains in imputation accuracy. Further, BSF offers full posterior inference for the imputed values, which can be studied using posterior summaries, and posterior inference for the unknown parameters marginalizes over the uncertainty in the imputed values. In choosing an imputation approach, researchers may value the ability to perform multiple imputation over single imputation and have access to the posterior distributions for the imputed values.

Our data application revealed a cluster of participants with HIV-associated OLD driven by joint metabolomic and proteomic abundance patterns. While this sheds light on some possible disrupted disease pathways, more research is needed to validate if such an OLD subtype exists in other cohorts. While BSFP improved upon existing methods in terms of prediction, it was challenging to capture the heterogeneity in FEV1pp in this dataset. One challenge may have been the low sample size (52) relative to the rank of the estimated factorization (41). In the future, considering a steeper penalty in the initialization of the model (i.e., smaller prior variance on the factorization components) could yield a lower-ranked factorization with more distinguished factors.

Our approach directly extends the UNIFAC method to a fully Bayesian framework, and can also be viewed as an extension of PMF. There is a dynamic literature on alternative approaches to Bayesian matrix factorization, particularly with respect to infinite factorization models with structured shrinkage on the loadings (Bhattacharya and Dunson, 2011; Legramanti et al., 2020). Our model choice was chiefly motivated by computational efficiency and simplicity, with efficiency achieved via fast initialization at the posterior mode and conjugate priors for direct Gibbs sampling, and simplicity via the lack of hyperparameters to specify and a fixed finite model dimension. However, there are advantages of Bayesian infinite factorization approaches that are worth exploring in our context (e.g., to accommodate uncertainty in the shared and individual ranks).

There are many avenues for further development. Expanding this framework to accommodate zero-inflated or non-Gaussian data, such as SNP or imaging data would be valuable. However, this would require an alternative strategy to rank selection. Implementing a Bayesian approach would be a worthwhile solution. Another avenue is exploring alternatives to choosing the prior variance on the factorization structures, λ and λs. One could impose priors or consider a cross validation approach. Accommodating “bidimensional” structure (multiple omics sources measured on multiple sample cohorts) would be worthwhile, as well as accommodating longitudinal data structures. It would also be straightforward to extend this framework to accommodate other parametric models for the outcome, such as a parametric survival model or Poisson model. Finally, future work could impose sparsity on the loadings or on the regression coefficients, though this would require careful consideration of the model identifiability.

Supplementary Material

1
2

Figure 6:

Figure 6:

Scores for each sample for Joint Factor 1, the joint factor that explains the largest variation within the joint structure. Each point reflects the posterior mean score and the interval reflects the 95% credible interval. Intervals colored in orange correspond to samples that belong to the stand-out cluster.

Figure 7:

Figure 7:

Loadings of each observed metabolite for Joint Factor 1. Each point reflects the posterior mean loading and the interval reflects the 95% credible interval. Intervals colored in orange correspond to those that do not contain 0.

Figure 8:

Figure 8:

Loadings of each observed protein for Joint Factor 1. Each point reflects the posterior mean loading and the interval reflects the 95% credible interval. Intervals colored in orange correspond to those that do not contain 0.

Table 2:

Top pathways based on the metabolites and proteins with non-zero loadings on “Joint Factor 1”, which explained the most variation across the metabolomic and proteomic datasets.

Pathway P-Value (Protein) Q-Value (Protein) P-Value (Metabolite) Q-Value (Metabolite) P-Value (Joint) Q-Value (Joint)

Neutrophil degranulation 0.00000 0.00000 1.00000 1 0.00000 0.00000
Innate Immune System 0.00000 0.00000 0.70600 1 0.00000 0.00000
Immune System 0.00000 0.00000 0.92000 1 0.00000 0.00000
Neutrophil extracellular trap formation - Homo sapiens 0.00000 0.00019 1.00000 1 0.00000 0.00019
HDMs demethylate histones 0.00000 0.00019 1.00000 1 0.00000 0.00019
Transcriptional misregulation in cancer - Homo sapiens 0.00000 0.00114 1.00000 1 0.00000 0.00114
Complement and coagulation cascades - Homo sapiens 0.00000 0.00114 1.00000 1 0.00000 0.00114
RHO GTPases activate PKNs 0.00000 0.00114 1.00000 1 0.00000 0.00114
Regulation of Insulin-like Growth Factor (IGF) transport and uptake by Insulin-like
Growth Factor Binding Proteins (IGFBPs)
0.00000 0.00160 1.00000 1 0.00000 0.00160
Signaling by Interleukins 0.00000 0.00201 1.00000 1 0.00000 0.00201

Acknowledgements

This work was supported by NIH grants R01-GM130622 and R01-HL140971. The views expressed in this article are those of the authors and do not reflect the views of the United States Government, the Department of Veterans Affairs, the funders, the sponsors, or any of the authors’ affiliated academic institutions.

Footnotes

*

An R package to perform BSFP is available at https://github.com/sarahsamorodnitsky/BSFP and full analysis code is available at https://github.com/sarahsamorodnitsky/BSFP_Analysis.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Agustí A, Hogg JC, 2019. Update on the pathogenesis of chronic obstructive pulmonary disease. New England Journal of Medicine 381, 1248–1256. [DOI] [PubMed] [Google Scholar]
  2. Akata K, Leung JM, Yamasaki K, Leitao Filho FS, Yang J, Xi Yang C, Takiguchi H, Shaipanich T, Sahin B, Whalen BA, Yang CWT, Sin DD, van Eeden SF, 2022. Altered Polarization and Impaired Phagocytic Activity of Lung Macrophages in People With Human Immunodeficiency Virus and Chronic Obstructive Pulmonary Disease. The Journal of Infectious Diseases 225, 862–867. [DOI] [PubMed] [Google Scholar]
  3. Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, Buettner F, Huber W, Stegle O, 2018. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Molecular systems biology 14, e8124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bhattacharya A, Dunson DB, 2011. Sparse bayesian infinite factor models. Biometrika, 291–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chekouo T, Safo SE, 2023. Bayesian integrative analysis and prediction with application to atherosclerosis cardiovascular disease. Biostatistics 24, 124–139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chekouo T, Stingo FC, Doecke JD, Do KA, 2017. A bayesian integrative approach for multi-platform genomic data: A kidney cancer case study. Biometrics 73, 615–624. [DOI] [PubMed] [Google Scholar]
  7. Cribbs SK, Uppal K, Li S, Jones DP, Huang L, Tipton L, Fitch A, Greenblatt RM, Kingsley L, Guidot DM, et al. , 2016. Correlation of the lung microbiota with metabolic profiles in bronchoalveolar lavage fluid in hiv infection. Microbiome 4, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Ding DY, Li S, Narasimhan B, Tibshirani R, 2022. Cooperative learning for multiview analysis. Proceedings of the National Academy of Sciences 119, e2202113119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fuentes M, Guttorp P, Sampson PD, 2006. Using transforms to analyze space-time processes. Monographs on Statistics and Applied Probability 107, 77. [Google Scholar]
  10. Gavish M, Donoho DL, 2017. Optimal shrinkage of singular values. IEEE Transactions on Information Theory 63, 2137–2152. [Google Scholar]
  11. Gaynanova I, Li G, 2019. Structural learning and integrative decomposition of multi-view data. Biometrics 75, 1121–1132. [DOI] [PubMed] [Google Scholar]
  12. Hellton KH, Thoresen M, 2016. Integrative clustering of high-dimensional data with joint and individual clusters. Biostatistics 17, 537–548. [DOI] [PubMed] [Google Scholar]
  13. Herrero-Cervera A, Soehnlein O, Kenne E, 2022. Neutrophils in chronic inflammatory diseases. Cellular & Molecular Immunology 19, 177–191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hirani A, Cavallazzi R, Vasu T, Pachinburavan M, Kraft WK, Leiby B, Short W, Desimone J, Squires KE, Weibel S, et al. , 2011. Prevalence of obstructive lung disease in hiv population: a cross sectional study. Respiratory medicine 105, 1655–1661. [DOI] [PubMed] [Google Scholar]
  15. Kamburov A, Cavill R, Ebbels TM, Herwig R, Keun HC, 2011. Integrated pathway-level analysis of transcriptomics and metabolomics data with impala. Bioinformatics 27, 2917–2918. [DOI] [PubMed] [Google Scholar]
  16. Kaplan A, Lock EF, 2017. Prediction with dimension reduction of multiple molecular data sources for patient survival. Cancer informatics 16, 1176935117718517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Klami A, Virtanen S, Kaski S, 2013. Bayesian canonical correlation analysis. Journal of Machine Learning Research 14. [Google Scholar]
  18. Kowarik A, Templ M, 2016. Imputation with the r package vim. Journal of Statistical Software 74, 1–16. [Google Scholar]
  19. Legramanti S, Durante D, Dunson DB, 2020. Bayesian cumulative shrinkage for infinite factorizations. Biometrika 107, 745–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lock EF, Hoadley KA, Marron JS, Nobel AB, 2013. Joint and individual variation explained (jive) for integrated analysis of multiple data types. The annals of applied statistics 7, 523. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Lock EF, Park JY, Hoadley KA, 2022. Bidimensional linked matrix factorization for pan-omics pan-cancer analysis. The annals of applied statistics 16, 193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Mallick H, Porwal A, Saha S, Basak P, Svetnik V, Paul E, 2023. An integrated bayesian framework for multi-omics prediction and classification. Statistics in Medicine n/a. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.9953, doi: 10.1002/sim.9953, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.9953. [DOI] [PubMed] [Google Scholar]
  23. Massy WF, 1965. Principal components regression in exploratory statistical research. Journal of the American Statistical Association 60, 234–256. [Google Scholar]
  24. Mnih A, Salakhutdinov RR, 2007. Probabilistic matrix factorization. Advances in neural information processing systems 20. [Google Scholar]
  25. Owen AB, Perry PO, 2009. Bi-Cross-Validation of the SVD and the Nonnegative Matrix Factorization. The Annals of Applied Statistics 3, 564–594. [Google Scholar]
  26. Palzer EF, Wendt CH, Bowler RP, Hersh CP, Safo SE, Lock EF, 2022. sjive: Supervised joint and individual variation explained. Computational Statistics & Data Analysis, 107547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Park JY, Lock EF, 2020. Integrative factorization of bidimensionally linked matrices. Biometrics 76, 61–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Poworoznek E, Ferrari F, Dunson D, 2021. Efficiently resolving rotational ambiguity in Bayesian matrix sampling with matching. arXiv preprint ArXiv: 2107.13783v1. [Google Scholar]
  29. Rudelson M, Vershynin R, 2010. Non-asymptotic theory of random matrices: extreme singular values, in: Proceedings of the ICM 2010, pp. 1576–1602. [Google Scholar]
  30. Safo SE, Min EJ, Haine L, 2022. Sparse linear discriminant analysis for multiview structured data. Biometrics 78, 612–623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Salakhutdinov R, Mnih A, 2008. Bayesian probabilistic matrix factorization using markov chain monte carlo, in: International Conference on Machine learning, pp. 880–887. [Google Scholar]
  32. Samorodnitsky S, Hoadley KA, Lock EF, 2022. A hierarchical spike- and-slab model for pan-cancer survival using pan-omic data. BMC bioinformatics 23, 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Samorodnitsky S, Lock EF, Kruk M, Morris A, Leung JM, Kunisaki KM, Griffin TJ, Wendt CH, 2023. Lung proteome and metabolome endotype in hiv-associated obstructive lung disease. ERJ Open Research 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Shen R, Mo Q, Schultz N, Seshan VE, Olshen AB, Huse J, Ladanyi M, Sander C, 2012. Integrative subtype discovery in glioblastoma using icluster. PloS one 7, e35236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Stekhoven DJ, Bühlmann P, 2012. Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118. [DOI] [PubMed] [Google Scholar]
  36. Tibshirani R, 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 267–288. [Google Scholar]
  37. Wang J, Wong RK, Mao X, Chan KCG, 2021. Matrix completion with model-free weighting, in: International Conference on Machine Learning, pp. 10927–10936. [Google Scholar]
  38. White BS, Khan SA, Mason MJ, Ammad-Ud-Din M, Potdar S, Malani D, Kuusanmäki H, Druker BJ, Heckman C, Kallioniemi O, et al. , 2021. Bayesian multi-source regression and monocyte-associated gene expression predict bcl-2 inhibitor resistance in acute myeloid leukemia. NPJ precision oncology 5, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Yang Z, Michailidis G, 2016. A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. Bioinformatics 32, 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Zhang Y, Gaynanova I, 2021. Joint association and classification analysis of multi-view data. Biometrics . [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1
2

RESOURCES