Summary
Multi-omics datasets are becoming more common, necessitating better integration methods to realize their revolutionary potential. Here, we introduce multi-set correlation and factor analysis (MCFA), an unsupervised integration method tailored to the unique challenges of high-dimensional genomics data that enables fast inference of shared and private factors. We used MCFA to integrate methylation markers, protein expression, RNA expression, and metabolite levels in 614 diverse samples from the Trans-Omics for Precision Medicine/Multi-Ethnic Study of Atherosclerosis multi-omics pilot. Samples cluster strongly by ancestry in the shared space, even in the absence of genetic information, while private spaces frequently capture dataset-specific technical variation. Finally, we integrated genetic data by conducting a genome-wide association study (GWAS) of our inferred factors, observing that several factors are enriched for GWAS hits and trans-expression quantitative trait loci. Two of these factors appear to be related to metabolic disease. Our study provides a foundation and framework for further integrative analysis of ever larger multi-modal genomic datasets.
Graphical abstract
Highlights
-
•
Rapid, unsupervised multi-modal data integration with self-inferred tuning parameters
-
•
614 ancestry-diverse participants from MESA/TOPMed with 5 omics types
-
•
Top shared components capture ancestry, even without genetic information
-
•
Further components are enriched for GWAS hits and related to metabolic disease
Brown, Wang et al. introduce MCFA, an approach to multi-modal dataset integration that generalizes canonical correlation analysis. MCFA is broadly applicable to data integration challenges but has been designed to handle issues in population-scale multi-omics data. A variety of analyses on the TOPMed/MESA multi-omics pilot demonstrate the power of this method.
Introduction
Recent years have seen an explosion in multi-omics data, with studies simultaneously profiling RNA expression, protein levels, chromatin accessibility, and more.1 By providing complementary views into the underlying biology, these datasets promise to illuminate molecular processes and disease states that cannot be gleaned from any lone modality.2 However, joint inference methods are lacking in either the number or type of modes that can be used or in flexibility and efficiency.1 Multi-omics data bring substantial challenges: distributions differ between modes, the sample size is typically small relative to features, efficient algorithms are needed, and each mode has contributions from factors that are shared between modes and unique to itself.3,4 Canonical correlation analysis (CCA) is a statistical technique that infers shared factors between two data modes by finding correlated linear combinations of the features in each.5 CCA has enjoyed substantial attention in genomics6,7,8,9; however, extending CCA to additional modes is fraught: at least 10 different formulations are equivalent in the two-mode case,10 and many are challenging to fit.11 Equivalently, CCA can be conceptualized as a probabilistic model (pCCA), revealing a connection to factor analysis.12
We have developed multi-set correlation and factor analysis (MCFA; Figures 1A and S1), an unsupervised integration method that generalizes pCCA and factor analysis, enabling fast inference of shared and private factors in multi-modal data. MCFA is designed to overcome challenges that are common with genomics data such as the large number of features relative to the sample size, the disparate data types, and the unknown contributions of dataset-specific technical factors. MCFA is based on two insights: (1) unlike traditional CCA, pCCA has only one natural extension to multi-modal data, which is both conceptually elegant and efficient to fit, and (2) after fitting pCCA, the residual in a mode represents private structure, which is well modeled by factor analysis. Our method combines these insights to fit factors that are shared across modalities and are private to each simultaneously. For efficiency and regularization, MCFA uses the top principal components (PCs) of each mode.6,7 It allows the use of random matrix techniques13 to choose the shared dimensionality and number of PCs, eliminating tuning parameters. Finally, MCFA is a natural approach to integration: as detailed in Methods S1, there is a theoretical connection between our model and multi-set CCA.
Figure 1.
Overview of MCFA integration results
(A) The MCFA model. Each observed data mode (Ym) has contributions from two latent factors, one private to it (Xm) and one shared with other modes (Z).
(B) Breakdown of the variance in four omics types captured by the inferred space, as well as the per-mode contribution to each shared factor.
(C) UMAP embedding of the shared and private spaces, annotated with the most relevant feature set. Broadly, the top shared factors capture demographics, while the top private factors capture technical variation.
(D) Variance in sample metadata explained by each learned space. This shows that the shared space also captures inferred cell-type composition estimates as well as clinical biomarkers.
We have applied MCFA to 614 ancestry-diverse individuals from the Multi-Ethnic Study of Atherosclerosis (MESA).14 The Trans-Omics for Precision Medicine (TOPMed)15 program instituted a multi-omics pilot study to evaluate the utility of long-term stored samples for discovery related to heart, lung, blood, and sleep disorders. MESA provided samples for five omics types: (1) whole-genome sequencing (WGS), (2) RNA sequencing of peripheral blood mononuclear cells (PBMCs), (3) DNA methylation array profiling from whole blood, (4) protein mass spectrometry of blood plasma, and (5) metabolite mass spectrometry of blood plasma. In addition, MESA has collected comprehensive phenotypic metadata. These data include demographic markers such as self-reported ancestry (SRA), sex, age, and education level; morphological features including height, weight, and hip circumference; clinical measures including those related to atherosclerosis, lipid levels, kidney function, and inflammatory biomarkers; and behavioral features regarding smoking, drinking, and exercise frequency.
Results
We integrated RNA sequencing, methylation, protein, and metabolite data using MCFA, which inferred a 14-dimensional shared space. We found that shared structure explained a large proportion of the variance in each mode (Figure 1B, right). Protein levels had the highest sharing with 29.2% of the variance explained (VE) by the shared space, followed by RNA and metabolite levels (16.6% and 17.1%, respectively). Methylation showed the least sharing, with only 8.1% VE by the shared space. Due to the high dimensionality of the data and the limited sample size, about half of the variance in each dataset is unmodeled to reduce overfitting. Using MCFA, it is possible to further infer the variance in each modality explained by the individual factors, thus determining which modalities contribute to each (Figure 1B, left). Our top factor has contributions from all modalities, but their respective contributions to the other factors vary substantially.
We used uniform manifold approximation and projection (UMAP)16 to construct a 2D embedding of the shared and private spaces (Figure 1C). We noticed a striking clustering of the individuals by SRA and sex in the shared space, even though the top PCs of individual modes do not cluster by these factors (Figure S2), and the shared space was inferred without genetic or sex chromosome features. Shared factor 1 separates Black and White individuals, with Hispanic individuals in between, while factor 3 separates Chinese individuals, and factor 2 differentiates by sex (Figures S2 and S3). We validated this structure via leave-one-out cross-validation, indicating our PC selection strategy mitigated over-fitting (Figure S4).
Next, we evaluated the total phenotypic VE by each of our inferred spaces (Figures 1D and S2; Tables S1, S2, and S3). The shared space captured 95.3% of the variation in sex, 83.3% in site, 80.0% in SRA, and 60.2% in age. The shared space also captured anthropomorphic differences such as BMI (51.0% VE) and clinical measures including those related to kidney function (creatine, 64.8% VE) and inflammation (tumor necrosis factor (TNF)-alpha receptor-1 69.1% VE). We used CIBERSORT17 and the Houseman method18 to estimate the cell-type composition of our RNA (PBMC) and methylation (whole blood) samples, respectively. Both shared and privates spaces contributed to the relative proportions of PBMC-abundant cell types (e.g., T cells and natural killer (NK) cells) estimated from both data modalities, while the proportion of PBMC-depleted types (e.g., neutrophils) estimated from the methylation data was only captured by the methylation private space. Modality-private spaces frequently captured technical factors: 100% of the variance in sequencing center and 71.6% of the variance in 3′ bias are captured by the RNA private space, while 76.8% of the methylation array batch is captured by its private space. Many phenotypes that are themselves measurements of metabolites were captured by the metabolite private space; however, the strongest association was with the month of sample collection (85.8% VE). We noticed no large associations between the protein private space and any of our metadata, despite several of our phenotypes being clinical protein markers; however, several of these factors are partially captured by the shared space.
We compared the results obtained on MESA using MCFA with other multi-modal analysis approaches. We focused on two alternative methods: (1) MOFA24 and (2) a multi-modal auto-encoder (MMAE, see STAR Methods and Figure S5). In the MOFA2 analysis, the methylation batch and cell-type proportions dominated the inferred shared space, likely owing to the very large number of features in that modality compared with the other modalities (Figure 2). The MMAE mitigated this over-focus on methylation somewhat and additionally captured RNA sequencing center and RNA cell-type proportions (Figure 2). Thus, neither MOFA2 nor the MMAE were able to infer shared variation while discarding dataset-specific technical artifacts. Moreover, using up to 8 cores of an Intel Xeon E5-2697v3 CPU on our cluster, MOFA2 took approximately 56 min to run when set to “medium” tolerance, while our MMAE took approximately 109 min to converge. In contrast, MCFA is able to process the same dataset in around 2 min.
Figure 2.
Comparison of MCFA with other methods
(A) UMAP embeddings of MOFA (left) and MMAE (right) shared space show that these methods fail to separate meaningful information from technical variation.
(B) Variance in sample metadata explained by the MOFA2 (top) and MMAE (bottom) shared spaces. MOFA2 primarily learns factors related to the methylation dataset, while the MMAE additionally incorporates some factors related to RNA sequencing.
(C) Correlation of each inferred factor with each metadata sample for MOFA (top) and the MMAE (bottom).
Finally, we integrated WGS data by conducting a genome-wide association study (GWAS) of the inferred factors while controlling for site, age, sex, and 11 genotype PCs. We hypothesized that genetic associations with our inferred factors, which represent major axes of molecular variation, may be enriched for known GWAS hits or trans-expression quantitative trait loci (eQTLs). We obtained a list of 10,174 such associations from the eQTLgen consortium,19 of which 3,854 are trans-eQTLs, and further defined a more limited set of 1,107 “influential” trans-eQTLs that affect at least 10 genes. We tested the GWAS of each factor for enrichment of these three categories and found 9 significant enrichments (mean , false discovery rate [FDR] 5%; Figures 3A and S6).
Figure 3.
Factor interpretation and integration with GWAS data
(A) QQ-plot of a GWAS for factors 1, 2, 6, and 7. Genetic associations with these factors are enriched for known GWAS loci (1, 6, and 7), trans-eQTLs (1 and 7), or highly influential trans-eQTLs (2 and 7).
(B and C) Correlation of factors 6 (B) and 7 (C) with morphological, immune-composition, and clinical metadata reveals that factor 6 is related to body composition and lipid profile, while factor 7 is related to body composition, inferred blood cell-type composition, and inflammatory biomarkers.
(D) Z-transformed correlation of individual protein and metabolite data with factor 6 reveals genes and metabolites related to insulin resistance and metabolic syndrome.
(E) Z-transformed correlation of individual methylation values with factor 7. Many genes colocated to these CpGs are involved in lipid metabolism.
Factor 7 showed the strongest enrichment for reported GWAS hits and trans-eQTLs. The top SNPs associated with factor 7 are from blood lipid studies and are located primarily around the FADS1 and FADS2 genes, which are known to regulate lipid metabolism.20 These include rs174541 ( for factor 7 association), which is also reported in GWASs of type 2 diabetes21; rs174549 (), which is also reported in GWASs of white blood cell count22; and rs1535 (), which is also reported in a GWAS of inflammatory bowel disease23 (Table S4). Factor 7 explains 6.7% of the modeled variation in methylation, the largest of any factor, and is anti-correlated with sample proportion of CD8+ T cells and NK cells estimated from methylation data ( and ), and correlated with BMI () and measures of inflammation including TNF-R1 () and interleukin-6 () (Figure 3B).
To assess the contribution of individual CpGs, we calculated the Z-transformed correlation of individual CpG values with factor 7 (Figure 3C). As epigenome-wide association studies remain small, generally little is known about the effects of individual CpGs and their associations with traits. Instead, we linked each gene to the CpGs falling in a window from 1.5 kb upstream of the transcription start site to the transcription termination site. Many of the genes colocated to CpGs with high weights for factor 7 have been implicated in lipid metabolism GWASs including IQCG and TMEM178A (cg01328500 and cg02571055; phosphatidylcholine levels24), DSCAML1 (cg02571055; triglyceride levels25), PTK2 (cg02153245; ApoB and low-density lipoprotein [LDL] levels26); TULP4 (cg02571055; lipoprotein A levels27), and C7orf50 (cg20054412; LDL, high-density lipoprotein [HDL], and total cholesterol levels28). Interestingly, our second strongest hit, cg00697440, is colocated with CD86. Recent work has suggested that B7 molecules including CD86 play an important role in regulating CD8+ T cell population dynamics.29 While further research is needed to establish causal relationships of these genetic effects and methylation patterns in cis and trans on gene regulation and diverse traits, DNA methylation patterns have been previously associated with lipid metabolism and metabolic disease.30,31 Further research is required to determine whether the immune-cell component of this factor is related to the lipid metabolism component or whether these are simply independent biological functions captured by the same factor.
We used the same strategy to interpret factor 6. Factor 6 is correlated with fasting glucose, waist circumference, and triglycerides (, and , respectively) and anti-correlated with HDL cholesterol (; Figure 3D). Factor 6 explains 6% of the variance in protein levels and 4.1% of the variance in metabolite levels. Many of the top-weighted metabolites are uncharacterized products from untargeted metabolomics, but the two top characterized targets are 2-hydroxybutyric acid, a known marker of insulin resistance and glucose intolerance,32,33 and glucose itself (Figure 3E). Several of the top-weighted proteins in this factor have known roles in growth and development including BMP1, GHR, IGFBP2, and FGFR1. GWASs have implicated BMP1 in coronary artery disease,34,35 IGFBP2 in type 2 diabetes and BMI,36 and FGFR1 in triglyceride levels28 and waist-hip ratio.37 Other notable highly weighted proteins include TFPI, which is involved in blood coagulation and is associated with BMI-adjusted waist-hip ratio,38 and ADIPOQ, which is involved in regulating glucose levels39 (Figure 3E). Many of the top GWAS hits associated with this factor corroborate these observations, including rs4805885, which is associated with adiponectin (ADIPOQ) levels40; rs9787485, which is associated with insulin-carbohydrate interaction41; and rs7679, which is associated with HDL, LDL, and triglyceride levels42 (Table S5).
Interestingly, the strongest genetic association with this factor comes from GWASs of schizophrenia (rs112973353; for factor 6 association), and we find 5 independent schizophrenia risk loci with factor 6 association p values below 0.01 (Table S5). Insulin resistance and schizophrenia have been consistently associated for nearly 100 years,43 and while the association signal of each locus with factor 6 is relatively weak, the probability of finding 5 independent loci with these p values under the null is approximately . While further research is needed, our results suggest that these particular loci may confer schizophrenia risk via insulin resistance. Another notable signal in our GWAS associations is related to erythrocyte and platelet traits. These hits include rs12451471 (; mean corpuscular hemoglobin concentration44; platelet count45) and rs13224082 (; platelet distribution width, platelet count, plateletcrit44), among others (Table S5). Again, further research is required to establish causality and direction of effect between genetics, metabolite and protein levels, and traits, but we note that there is an established link between insulin resistance and platelet dysfunction.46
Discussion
MCFA has several advantages compared with other multi-omics integration approaches. Compared with group factor analysis methods,4 MCFA separates modality-specific from dataset-shared factors. Compared with non-negative matrix factorization-based methods that share a feature weight set across modalities,3 MCFA is able to use all data types. As we have shown, MCFA is also substantially faster and is able to handle datasets with unbalanced numbers of features across the modes.
While our top factors captured ancestry and sex, these factors are usually observed and considered confounding in clinical applications. In that context, one could fit the model conditional on known confounding factors. Since we see exploratory data analysis as a primary application of MCFA, our goal instead was to map the primary axes of biological variation contained within these population-scale multi-omics data. It is important that these factors are a primary driver of variation within such data, as it implies that sampling across race and sex is critical for equitable discovery in medical genomics. Still, because these factors are captured by the top components, and the components themselves are orthogonal, further components can still capture disease-relevant information.
Integration with GWAS is biased toward well-powered studies that will typically have more hits, some of which may be acting indirectly through another phenotype.47 Interpretability of factors is also biased toward the metadata collected in the study. In MESA, the goal was evaluation of risk factors for heart disease, and thus MESA focused metadata collection on lipid phenotypes, inflammatory biomarkers, and body morphology. It is therefore unsurprising that we are most easily able to interpret factors related to metabolic syndrome, lipid metabolism, and immune function in this study. Still, the ability of MCFA to produce results that are correlated with these factors demonstrates the utility of broad-scale sample metadata when interpreting results from multi-omics studies.
Careful consideration is required when analyzing multi-omics datasets that include WGS or genotype data. There are two primary ways that one can think about integrating these data: (1) include genetic information as a mode in the fit model, interpretable as inferring a latent state that affects genotype as well as molecular factors, or (2) look for genetic associations with inferred molecular factors, interpretable as mapping QTLs for inferred molecular phenotypes. In this study, we chose the latter due to the improved causal interpretation and to demonstrate the utility of surrogate molecular phenotypes. In other cases, for example the analysis of genetic copy-number variation data in tumor samples, the former analysis approach may be preferred. Future work with larger sample sizes may allow for network inference and Mendelian randomization methods to generate directed hypotheses.47,48 Genetic associations are particularly valuable in this, with the inferred axes of molecular variation providing promising future traits for GWAS and phenome-wide association studies. TOPMed is among the most ambitious current efforts to collect multi-omics population-level data; thus, given the results of this pilot analysis, we expect future integration studies in this cohort to be fruitful.
Limitations of the study
Due to the use of observational data and unsupervised methods, all analyses should be considered exploratory; they can find structure in the data while generating hypotheses but cannot be used to make causal claims and may reflect technical properties of the underlying data. For example, in MESA, the sample collection site is strongly correlated with SRA. We repeated our analysis of the VE by the learned space while additionally controlling for site (Table S3) and noticed a small decrease in the proportion of VE in SRA (from 80.0% to 71.6%).
We observed that estimated cell-type composition had a strong association with both shared and private spaces. Since cell-type composition was inferred from the data, there may be circularity in composition estimation itself. In addition, complex interactions exist between cell-type composition in tissue samples and clinical, environmental factors as well as technical factors related to biospecimen collection. Thus, caution is necessary for biological interpretation in this aspect of the analysis.
STAR★Methods
Key resources table
REAGENT or RESOURCE | SOURCE | IDENTIFIER |
---|---|---|
Deposited data | ||
MESA TOPMed multi-omics pilot data | dbGaP | dbGaP: phs001416.v3.p1 |
Software and algorithms | ||
Multiset Correlation and Factor Analysis | Zenodo | https://doi.org/10.5281/zenodo.7951370 |
MOFA2 | github | https://github.com/bioFAM/MOFA2 |
eQTLgen trans-eQTL summary statistics | eQTLgen | https://www.eqtlgen.org/trans-eqtls.html |
Resource availability
Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Dr. Brielin Brown (bbrown@nygenome.org).
Materials availability
This study did not generate new unique reagents.
Method details
Multiset correlation and factor analysis
Let be a set of observed data matrices: individuals measured in data modalities consisting of features each. We model each observed mode as having contributions from two low-dimensional hidden factors (Figures 1A and S8)
where is the shared hidden dimensionality, are the dataset-private hidden dimensionalities, are shared space loading matrices, are private space loading matrices and are the diagonal residual covariance matrices. Given , and , our goal is to infer the hidden factors and and loading matrices and . This can be accomplished using a straightforward application of expectation maximization (EM).49 For a derivation of the EM update equations, as well as a more detailed exposition including the relationship to pCCA, factor analysis and other multiset CCA (MCCA) methods, see Methods S1. In practice, we center and scale all data variables. This is not strictly required, however it enables simple estimation of the number of PCs to include and simplifies explained variance calculations, see below.
Model initialization
An important aspect of EM optimization is choosing a good initialization. We benchmarked three approaches to initializing : random initialization and two versions of MCCA that correspond to maximizing the sum of pairwise correlations with the average variance and average norm constraints. These MCCA formulations can be solved via simple eigendecompositions. We found that the sum of pairwise correlations with average variance constraint produced the best initial estimates (Figure S7). This can be solved with a simple two step procedure: 1) whiten each data matrix using the singular value decomposition (SVD), 2) perform a second SVD on the concatenated whitened data matrices50:
Input:
Result:
concatenate(SVD(). SVD(). );
SVD(). ;
SVD(). ;
return
We initialize and using probabilistic PCA on the residual data matrices after fitting MCCA. Specifically:
Input:
Result:
;
eigh().
mean(eigh(). );
;
return
High dimensionality and selection of hyperparameters
There are two primary approaches to control for over-fitting in applications of CCA-type methods to high-dimensional () problems. The first is to use penalized optimization techniques, where the objective function additionally contains an constraint on the weight matrices.51 The second is to project each dataset onto its informative principal components.6,7,11 In this application, we choose the latter approach in order to find components with broad effects on the structure of the data, rather than specific effects on small numbers of molecular features.11 We choose the number of principal components of each dataset using the Marchenko-Pasteur law,13 which states that for mean , variance data, principal components with corresponding eigenvalues above should be considered non-noise. We are not aware of a corresponding law for the cross-covariance matrices used in CCA, however, the empirical spectral distribution of the cross-covariance of matrices of random noise can be easily estimated in practice:
Input:
Result:
for to do
for do
end
max(InitializeMCFA().
end
return mean()
Then we keep all components where .
Calculating the variance explained
The linear-Gaussian nature of the model simplifies estimation of the variance explained. That is, if the features of each mode are normalized to variance , the model implies that the variance in feature of mode explained by shared factor is . Likewise, the variance explained by the -th private factor of mode is . The total variance in mode explained by a given shared factor (respectively, private factor ) is thus given by (respectively, ), and the total variance in the mode explained by the factors are and , respectively. Note that when working in PC-space, the raw and features correspond to variance in PCs explained, rather than modality features. Thus, we calculate the variance explained after projecting back into the original feature space where are the right singular vectors of mode .
To calculate the variance in a metadata feature explained by a particular space, we regressed the trait value on the shared or private space, or . For continuous-valued traits we used linear regression as implemented in SciKitLearn v1.0 linear_model.LinearRegression and report the coefficient of determination.52 For discrete-valued traits, we used multinomial logistic regression as implemented in SciKitLearn v1.0 linear_model.LogisticRegression.52 We fit two models: a null model including only intercept or intercept and site, and one including the factor variables. We report the variance explained as the McFadden pseudo-, , with and being the model negative log likelihood for the null and alternative model respectively.53
Calculating relative feature importance
Feature importance in traditional CCA is defined by the correlation of the variables in the reduced space . Unfortunately this notion breaks down in higher dimensions. As we discuss further in Methods S1, the degree of sharing in MCCA is defined by functions of the cross-correlation matrix in the reduced space,
We seek to define an analogous quantity for our graphical model. In MCFA, the data in the reduced (shared) space is given by the posterior mean of , . We can also calculate the posterior mean of conditional on observing a single mode, . This latter quantity is analogous to the reduced variables in MCCA. Thus we can summarize the importance of each dimension of the shared space by calculating functions of the cross-correlation of columns of ,
The relevant function in our model is the generalized variance , see Methods S1. The determinant of a correlation matrix is bounded between and , with lower values indicating more correlation, and higher values less. Thus to aid interpretability, we report and reorder columns of and with decreasing .
SNP set enrichment analysis
For SNP set enrichment analysis, we broadly follow the approach of CAMERA.54 In brief, enrichment statistics can be inflated due to correlations in the sample - in this case, linkage disequilibrium between two GWAS SNPs. This results in an under-estimate of the standard error of the enrichment test statistic and an increase in false positives. We calculate the variance inflation factor by using plink v 55 to estimate linkage disequilibrium between annotation SNPs in unrelated individuals from the UK Biobank.56 The variance inflation factor is , with the average person correlation between features in set . We test the known GWAS mean statistic against the alternative . The standard error of the test statistic is with the pooled empirical standard deviation of the test statistics.
The MESA multi-omics pilot
The Multi-Ethnic Study of Atherosclerosis (MESA) is a prospective cohort study with the goal to identify progression of subclinical atherosclerosis.14 MESA recruited 6,814 participants, ages 45–84 years and free of clinical cardiovascular disease, during 2000–2002. The participants are 53% female, 38% non-Hispanic white, 28% Black, 22% Hispanic and 12% Asian-American. The Multi-Omics pilot dataset includes 30x whole genome sequencing (WGS) through the Trans-Omics for Precision Medicine (TOPMed) Project.15
Blood samples for multi-omic analysis of participants were collected at two time points (exam 1 and exam 5). RNA expression was profiled using poly-A RNA sequencing of PBMCs, and methylation was quantified by the Illumina 750K EPIC array in whole blood. The levels of 1,305 proteins were measured from plasma samples using the standard SOMAscan DNA aptamer–based platform, and metabolite levels were determined from targeted and untargeted mass spectrometry of blood plasma. The MESA Multi-Omics pilot biospecimen collection, molecular phenotype data production and quality control (QC) are described in detail in Kasela et al.57
Cross-validation
We used leave-one-out cross-validation (CV) to evaluate our model. The primary reason we chose leave-one-out CV over -fold CV is that our hyperparameter selection method depends on the sample size. With individuals, the same parameters used for the full inference procedure are likely to be valid. For small , fitting with individuals while using the same number of PCs may result in over-fitting in the training set, and using a smaller number of PCs may not capture the same variation as the full model.
To perform cross-validation we hold out a set of individuals, fit the MCFA model, then project the held out individuals into the learned space. If and are the model parameters learned from the training set, the projections of the test data into the learned spaces are given by
The full data reconstruction is
We evaluate model fit by calculating the normalized root mean squared error (NRMSE). In order to provide a fair evaluation across modes with a highly variable number of features, we calculate NRMSE on a per mode basis
and potential over-fitting can be assessed by comparing the median training set NRMSE against the median test set NRMSE over many cross-validation iterations.
Comparison to MOFA2 and MMAE
We installed MOFA2 version 0.6.7 using pip install mofapy2. We used the options scale_groups = False, scale_views = False, ard_weights = True and spikeslab_weights = True. We set the convergence tolerance to convergence_mode = ’medium’. For comparison purposes we set the number of factors equal to the hidden dimensionality inferred by MCFA (factors = 14).
Our multi-modal auto-encoder architecture is visualized in Figure S4. We used two hidden layes per dataset, with the first layer having dimensionality equal to 8 times that modalities MCFA-inferred number of PCs, and the second layer having dimensionality equal to that modalities MCFA-inferred number of PCs. These layers are then concatenated, and sent through an additional hidden layer with 8 times the MCFA-inferred number of shared dimensions to the final 14-dimensional encoded representation. All layers except the final encoder layer consist of a linear transform followed by ReLU activation, while the final encoder layer omits the ReLU activation. The decoder had identical architecture to the encoder only reversed. The network was implemented in pytorch v1.11.0 and optimized with Adagrad using 10 batches per epoch until the NRMSE change relative to the total loss was less than .
Quantification and statistical analysis
We analyzed individuals from Exam 1 where all five data types were collected and passed QC. All data modalities were inverse rank normalized prior to sample filtering based on the availability of other data types. There were 614 individuals with observations of WGS, RNA-seq, methylation, metabolomics and proteomics that all pass QC. We further removed all features (CpGs, genes, proteins) located on sex-chromosomes, -variance features, CpGs with missing data, and CpGs where the probe was within 5 bases of an SNP, leaving us with metabolites, proteins, genes, and CpGs. We analyzed PCs of RNA expression, PCs of methylation, PCs of protein expression and PCs of metabolite, as determined using the aforementioned method. For sample metadata, we leveraged the rich phenotype data available in MESA that were harmonized by the TOPMed Data Coordinating Center.58 For details on the estimation of sample cell-type proportions from methylation and RNA-seq data, see Kasela et al.57 Genetic association analyses were conducted using plink v 1.955 while controlling for site, age, sex and 11 genotype PCs; reported -values are uncorrected and tested against a null of effect. SNP set enrichment significance was defined as having an FDR -value below when corrected for 3 tested sets across 14 factors tested against the null hypothesis that the mean test statistic is 1.
Acknowledgments
B.C.B. would like to thank Lior Pachter and Nicholas Bray for numerous insightful conversations about CCA over the years. B.C.B. would also like to thank Andrew Stirn for reviewing the auto-encoder code. Funding for D.A.K. and B.C.B. is provided by NIA U01AG068880. Funding for B.C.B. is provided by NHGRI K99HG012373 and the Columbia Data Science Institute. Funding for T.L. and S.K. is provided by National Heart, Lung, and Blood Institute (NHLBI) R01HL142028. Funding for T.L. is provided by NIH R01AG057422 and NIMH R01MH106842. Funding for MESA lung measures is provided by NHLBI R01HL077612 and R01HL093081. WGS for the TOPMed program was supported by the NHLBI. WGS for “NHLBI TOPMed: Multi-Ethnic Study of Atherosclerosis (MESA)” (phs001416.v1.p1) was performed at the Broad Institute of MIT and Harvard (3U54HG003067-13S1). Centralized read mapping and genotype calling, along with variant quality metrics and filtering, were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1). Phenotype harmonization, data management, sample-identity quality control (QC), and general study coordination were provided by the TOPMed Data Coordinating Center (3R01HL-120393-02S1) and TOPMed MESA Multi-Omics (HHSN2682015000031/HSN26800004). The MESA projects are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators (accession number phs000209.v13.p3). Support for the MESA projects are conducted and supported by the NHLBI in collaboration with MESA investigators. Support for MESA is provided by contracts 75N92020D00001, HHSN268201500003I, N01-HC-95159, 75N92020D00005, N01-HC-95160, 75N92020D00002, N01-HC-95161, 75N92020D00003, N01-HC-95162, 75N92020D00006, N01-HC-95163, 75N92020D00004, N01-HC-95164, 75N92020D00007, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420, UL1TR001881, DK063491, and R01HL105756. The authors thank the other investigators, the staff, and the participants of the MESA study for their valuable contributions. A full list of participating MESA investigators and institutes can be found at http://www.mesa-nhlbi.org.
Author contributions
Conceptualization, B.C.B., D.A.K., and T.L.; methodology, B.C.B. and C.W.; software, B.C.B. and C.W.; validation, B.C.B.; formal analysis, B.C.B., C.W., and D.A.K.; investigation, B.C.B. and C.W.; resources, S.K., F.A., D.C.N., K.D.T., R.P.T., P.D., Y.L., W.C.J., D.V.D.B., N.G., S.G., J.D.S., R.G., C.C., Q.W., G.P., T.W.B., J.I.R., S.S.R., R.G.B., K.G.A., D.A.K., and T.L.; data curation, B.C.B., S.K., F.A., D.C.N., K.D.T., R.P.T., P.D., Y.L., W.C.J., D.V.D.B., N.G., S.G., J.D.S., R.G., C.C., Q.W., G.P., T.W.B., J.I.R., S.S.R., R.G.B., K.G.A., D.A.K., and T.L.; writing – original draft, B.C.B.; writing – review & editing, B.C.B., C.W., S.K., J.I.R., S.S.R., D.A.K., and T.L.; supervision, J.I.R., S.S.R., D.A.K., and T.L.; funding acquisition, B.C.B., R.G.B., D.A.K., and T.L.
Declaration of interests
T.L. is a paid adviser or consultant of GSK, Pfizer, and Goldfinch Bio and has equity in Variant Bio. F.A. is an employee and shareholder of Illumina, Inc.
Published: July 10, 2023
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xgen.2023.100359.
Supplemental information
Data and code availability
The MESA TOPMed multi-omics pilot data have been deposited on dbGap and are publicly available as of the date of publication. The accession number is listed in the key resources table. All original code has been deposited on zenodo and is publicly available as of the date of publication. The DOI is listed in the key resources table. The code is also available on github at https://github.com/collinwa/MCFA. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
References
- 1.Krassowski M., Das V., Sahu S.K., Misra B.B. State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing. Front. Genet. 2020;11:610798. doi: 10.3389/FGENE.2020.610798. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hasin Y., Seldin M., Lusis A. Multi-omics approaches to disease. Genome Biol. 2017;18:83. doi: 10.1186/S13059-017-1215-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Welch J.D., Kozareva V., Ferreira A., Vanderburg C., Martin C., Macosko E.Z. Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell. 2019;177:1873–1887.e17. doi: 10.1016/j.cell.2019.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Argelaguet R., Velten B., Arnol D., Dietrich S., Zenz T., Marioni J.C., Buettner F., Huber W., Stegle O. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 2018;14:e8124. doi: 10.15252/msb.20178124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hotelling H. Relations Between Two Sets of Variates. Biometrika. 1936;28:321–377. doi: 10.2307/2333955. [DOI] [Google Scholar]
- 6.Brown B.C., Bray N.L., Pachter L. Expression reflects population structure. PLoS Genet. 2018;14 doi: 10.1371/journal.pgen.1007841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Soneson C., Lilljebjörn H., Fioretos T., Fontes M. Integrative analysis of gene expression and copy number alterations using canonical correlation analysis. BMC Bioinformatics. 2010;11:1–20. doi: 10.1186/1471-2105-11-191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Naylor M.G., Lin X., Weiss S.T., Raby B.A., Lange C. Using Canonical Correlation Analysis to Discover Genetic Regulatory Variants. PLoS One. 2010;5:e10395. doi: 10.1371/JOURNAL.PONE.0010395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Butler A., Hoffman P., Smibert P., Papalexi E., Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 2018;36:411–420. doi: 10.1038/nbt.4096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kettenring J.R. Canonical analysis of several sets of variables. Biometrika. 1971;58:433–451. [Google Scholar]
- 11.Asendorf N.A. Informative Data Fusion: Beyond Canonical Correlation Analysis. 2015. https://deepblue.lib.umich.edu/handle/2027.42/113419 [Google Scholar]
- 12.Bach F.R., Jordan M.I. A Probabilistic Interpretation of Canonical Correlation Analysis. 2005. https://www.di.ens.fr/∼fbach/probacca.pdf [Google Scholar]
- 13.Marčenko V.A., Pastur L.A. Distribution of Eigenvalues for Some Sets of Random Matrices. Math. USSR. Sb. 1967;1:457–483. doi: 10.1070/sm1967v001n04abeh001994. [DOI] [Google Scholar]
- 14.Bild D.E., Bluemke D.A., Burke G.L., Detrano R., Diez Roux A.V., Folsom A.R., Greenland P., Jacob D.R., Jr., Kronmal R., Liu K., et al. Multi-Ethnic Study of Atherosclerosis: objectives and design. Am. J. Epidemiol. 2002;156:871–881. doi: 10.1093/AJE/KWF113. [DOI] [PubMed] [Google Scholar]
- 15.Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M., et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Mcinnes L., Healy J., Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv. 2020 doi: 10.48550/arXiv.1802.03426. Preprint at. [DOI] [Google Scholar]
- 17.Newman A.M., Liu C.L., Green M.R., Gentles A.J., Feng W., Xu Y., Hoang C.D., Diehn M., Alizadeh A.A. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods. 2015;12:453–457. doi: 10.1038/nmeth.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Houseman E.A., Accomando W.P., Koestler D.C., Christensen B.C., Marsit C.J., Nelson H.H., Wiencke J.K., Kelsey K.T. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13:86. doi: 10.1186/1471-2105-13-86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Võsa U., Claringbould A., Westra H.J., Bonder M.J., Deelen P., Zeng B., Kirsten H., Saha A., Kreuzhuber R., Yazar S., et al. Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat. Genet. 2021;53:1300–1310. doi: 10.1038/s41588-021-00913-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Schaeffer L., Gohlke H., Müller M., Heid I.M., Palmer L.J., Kompauer I., Demmelmair H., Illig T., Koletzko B., Heinrich J. Common genetic variants of the FADS1 FADS2 gene cluster and their reconstructed haplotypes are associated with the fatty acid composition in phospholipids. Hum. Mol. Genet. 2006;15:1745–1756. doi: 10.1093/HMG/DDL117. [DOI] [PubMed] [Google Scholar]
- 21.Dupuis J., Langenberg C., Prokopenko I., Saxena R., Soranzo N., Jackson A.U., Wheeler E., Glazer N.L., Bouatia-Naji N., Gloyn A.L., et al. New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat. Genet. 2010;42:105–116. doi: 10.1038/NG.520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Astle W.J., Elding H., Jiang T., Allen D., Ruklisa D., Mann A.L., Mead D., Bouman H., Riveros-Mckay F., Kostadima M.A., et al. The Allelic Landscape of Human Blood Cell Trait Variation and Links to Common Complex Disease. Cell. 2016;167:1415–1429.e19. doi: 10.1016/J.CELL.2016.10.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Liu J.Z., Van Sommeren S., Huang H., Ng S.C., Alberts R., Takahashi A., Ripke S., Lee J.C., Jostins L., Shah T., et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 2015;47:979–986. doi: 10.1038/NG.3359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Rhee E.P., Ho J.E., Chen M.H., Shen D., Cheng S., Larson M.G., Ghorbani A., Shi X., Helenius I.T., O’Donnell C.J., et al. A genome-wide association study of the human metabolome in a community-based cohort. Cell Metab. 2013;18:130–143. doi: 10.1016/J.CMET.2013.06.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Pollin T.I., Damcott C.M., Shen H., Ott S.H., Shelton J., Horenstein R.B., Post W., McLenithan J.C., Bielak L.F., Peyser P.A., et al. A null mutation in human APOC3 confers a favorable plasma lipid profile and apparent cardioprotection. Science. 2008;322:1702–1705. doi: 10.1126/SCIENCE.1161524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Richardson T.G., Sanderson E., Palmer T.M., Ala-Korpela M., Ference B.A., Davey Smith G., Holmes M.V. Evaluating the relationship between circulating lipoprotein lipids and apolipoproteins with risk of coronary heart disease: A multivariable Mendelian randomisation analysis. PLoS Med. 2020;17:e1003062. doi: 10.1371/JOURNAL.PMED.1003062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Sinnott-Armstrong N., Tanigawa Y., Amar D., Mars N., Benner C., Aguirre M., Venkataraman G.R., Wainberg M., Ollila H.M., Kiiskinen T., et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 2021;53:185–194. doi: 10.1038/s41588-020-00757-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Graham S.E., Clarke S.L., Wu K.H.H., Kanoni S., Zajac G.J.M., Ramdas S., Surakka I., Ntalla I., Vedantam S., Winkler T.W., et al. The power of genetic diversity in genome-wide association studies of lipids. Nature. 2021;600:675–679. doi: 10.1038/S41586-021-04064-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zenke S., Palm M.M., Braun J., Gavrilov A., Meiser P., Böttcher J.P., Beyersdorf N., Ehl S., Gerard A., Lämmermann T., et al. Quorum Regulation via Nested Antagonistic Feedback Circuits Mediated by the Receptors CD28 and CTLA-4 Confers Robustness to T Cell Population Dynamics. Immunity. 2020;52:313–327.e7. doi: 10.1016/J.IMMUNI.2020.01.018. [DOI] [PubMed] [Google Scholar]
- 30.Mittelstraß K., Waldenberger M. DNA methylation in human lipid metabolism and related diseases. Curr. Opin. Lipidol. 2018;29:116–124. doi: 10.1097/MOL.0000000000000491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gomez-Alonso M.D.C., Kretschmer A., Wilson R., Pfeiffer L., Karhunen V., Seppälä I., Zhang W., Mittelstraß K., Wahl S., Matias-Garcia P.R., et al. DNA methylation and lipid metabolism: an EWAS of 226 metabolic measures. Clin. Epigenetics. 2021;13:7. doi: 10.1186/s13148-020-00957-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Gall W.E., Beebe K., Lawton K.A., Adam K.P., Mitchell M.W., Nakhle P.J., Ryals J.A., Milburn M.V., Nannipieri M., Camastra S., et al. α-Hydroxybutyrate Is an Early Biomarker of Insulin Resistance and Glucose Intolerance in a Nondiabetic Population. PLoS One. 2010;5:e10883. doi: 10.1371/JOURNAL.PONE.0010883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ferrannini E., Natali A., Camastra S., Nannipieri M., Mari A., Adam K.P., Milburn M.V., Kastenmüller G., Adamski J., Tuomi T., et al. Early Metabolic Markers of the Development of Dysglycemia and Type 2 Diabetes and Their Physiological Significance. Diabetes. 2013;62:1730–1737. doi: 10.2337/DB12-0707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Van Der Harst P., Verweij N. Identification of 64 Novel Genetic Loci Provides an Expanded View on the Genetic Architecture of Coronary Artery Disease. Circ. Res. 2018;122:433–443. doi: 10.1161/CIRCRESAHA.117.312086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Aragam K.G., Jiang T., Goel A., Kanoni S., Wolford B.N., Atri D.S., Weeks E.M., Wang M., Hindy G., Zhou W., et al. Discovery and systematic characterization of risk variants and genes for coronary artery disease in over a million participants. Nat. Genet. 2022;54:1803–1815. doi: 10.1038/S41588-022-01233-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zhao W., Rasheed A., Tikkanen E., Lee J.J., Butterworth A.S., Howson J.M.M., Assimes T.L., Chowdhury R., Orho-Melander M., Damrauer S., et al. Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. Nat. Genet. 2017;49:1450–1457. doi: 10.1038/NG.3943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Pulit S.L., Stoneman C., Morris A.P., Wood A.R., Glastonbury C.A., Tyrrell J., Yengo L., Ferreira T., Marouli E., Ji Y., et al. Meta-analysis of genome-wide association studies for body fat distribution in 694 649 individuals of European ancestry. Hum. Mol. Genet. 2019;28:166–174. doi: 10.1093/HMG/DDY327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Justice A.E., Karaderi T., Highland H.M., Young K.L., Graff M., Lu Y., Turcot V., Auer P.L., Fine R.S., Guo X., et al. Protein-coding variants implicate novel genes related to lipid homeostasis contributing to body-fat distribution. Nat. Genet. 2019;51:452–469. doi: 10.1038/s41588-018-0334-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Martinez-Huenchullan S.F., Tam C.S., Ban L.A., Ehrenfeld-Slater P., Mclennan S.V., Twigg S.M. Skeletal muscle adiponectin induction in obesity and exercise. Metabolism. 2020;102:154008. doi: 10.1016/j.metabol.2019.154008. [DOI] [PubMed] [Google Scholar]
- 40.Dastani Z., Hivert M.F., Timpson N., Perry J.R.B., Yuan X., Scott R.A., Henneman P., Heid I.M., Kizer J.R., Lyytikäinen L.P., et al. Novel loci for adiponectin levels and their influence on type 2 diabetes and metabolic traits: a multi-ethnic meta-analysis of 45,891 individuals. PLoS Genet. 2012;8:e1002607. doi: 10.1371/JOURNAL.PGEN.1002607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Zheng J.S., Arnett D.K., Lee Y.C., Shen J., Parnell L.D., Smith C.E., Richardson K., Li D., Borecki I.B., Ordovás J.M., et al. Genome-wide contribution of genotype by environment interaction to variation of diabetes-related traits. PLoS One. 2013;8:e77442. doi: 10.1371/JOURNAL.PONE.0077442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Kathiresan S., Willer C.J., Peloso G.M., Demissie S., Musunuru K., Schadt E.E., Kaplan L., Bennett D., Li Y., Tanaka T., et al. Common variants at 30 loci contribute to polygenic dyslipidemia. Nat. Genet. 2009;41:56–65. doi: 10.1038/NG.291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Henkel N.D., Wu X., O’Donovan S.M., Devine E.A., Jiron J.M., Rowland L.M., Sarnyai Z., Ramsey A.J., Wen Z., Hahn M.K., et al. Schizophrenia: a disorder of broken brain bioenergetics. Mol. Psychiatry. 2022;27:2393–2404. doi: 10.1038/s41380-022-01494-x. [DOI] [PubMed] [Google Scholar]
- 44.Vuckovic D., Bao E.L., Akbari P., Lareau C.A., Mousas A., Jiang T., Chen M.H., Raffield L.M., Tardaguila M., Huffman J.E., et al. The Polygenic and Monogenic Basis of Blood Traits and Diseases. Cell. 2020;182:1214–1231.e11. doi: 10.1016/j.cell.2020.08.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Chen M.H., Raffield L.M., Mousas A., Sakaue S., Huffman J.E., Moscati A., Trivedi B., Jiang T., Akbari P., Vuckovic D., et al. Trans-ethnic and Ancestry-Specific Blood-Cell Genetics in 746,667 Individuals from 5 Global Populations. Cell. 2020;182:1198–1213.e14. doi: 10.1016/j.cell.2020.06.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Vinik A.I., Erbas T., Park T.S., Nolan R., Pittenger G.L. Platelet Dysfunction in Type 2 Diabetes. Diabetes Care. 2001;24:1476–1485. doi: 10.2337/DIACARE.24.8.1476. [DOI] [PubMed] [Google Scholar]
- 47.Brown B.C., Knowles D.A. Phenome-scale causal network discovery with bidirectional mediated Mendelian randomization. bioRxiv. 2020 doi: 10.1101/2020.06.18.160176. Preprint at. [DOI] [Google Scholar]
- 48.Brown B.C., Knowles D.A. Welch-weighted Egger regression reduces false positives due to correlated pleiotropy in Mendelian randomization. Am. J. Hum. Genet. 2021;108:2319–2335. doi: 10.1016/J.AJHG.2021.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Dempster A.P., Laird N.M., Rubin D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 1977;39:1–22. [Google Scholar]
- 50.Parra L.C. Multiset Canonical Correlation Analysis simply explained. arXiv. 2018 doi: 10.48550/arXiv.1802.03759. Preprint at. [DOI] [Google Scholar]
- 51.Witten D.M., Tibshirani R.J. Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data. Stat. Appl. Genet. Mol. Biol. 2009;8:Article28. doi: 10.2202/1544-6115.1470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al. Scikit-learn: Machine Learning in Python. arXiv. 2011 doi: 10.48550/arXiv.1201.0490. Preprint at. [DOI] [Google Scholar]
- 53.McFadden D. Frontiers in Econometrics. 1973. Conditional logit analysis of qualitative choice behavior; pp. 105–142. [Google Scholar]
- 54.Wu D., Smyth G.K. Camera: A competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res. 2012;40:e133. doi: 10.1093/nar/gks461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Chang C.C., Chow C.C., Tellier L.C., Vattikuti S., Purcell S.M., Lee J.J. Second-generation PLINK: Rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7–16. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Kasela S., Aguet F., Kim-Hellmuth S., Brown B.C., Nachun D.C., Tracy R.P., Durda P., Liu Y., Taylor K.D., Johnson W.C., et al. Interaction molecular QTL mapping discovers cellular and environmental modifiers of genetic regulatory effects. bioRxiv 2022. doi:10.1101/2023.06.26.546528. https://www.biorxiv.org/content/10.1101/2023.06.26.546528v1 [DOI] [PMC free article] [PubMed]
- 58.Stilp A.M., Emery L.S., Broome J.G., Buth E.J., Khan A.T., Laurie C.A., Wang F.F., Wong Q., Chen D., D’Augustine C.M., et al. A System for Phenotype Harmonization in the National Heart, Lung, and Blood Institute Trans-Omics for Precision Medicine (TOPMed) Program. Am. J. Epidemiol. 2021;190:1977–1992. doi: 10.1093/aje/kwab115. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The MESA TOPMed multi-omics pilot data have been deposited on dbGap and are publicly available as of the date of publication. The accession number is listed in the key resources table. All original code has been deposited on zenodo and is publicly available as of the date of publication. The DOI is listed in the key resources table. The code is also available on github at https://github.com/collinwa/MCFA. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.