Abstract
Motivation
Integrative analysis of multi-omics data from different high-throughput experimental platforms provides valuable insight into regulatory mechanisms associated with complex diseases, and gains statistical power to detect markers that are otherwise overlooked by single-platform omics analysis. In practice, a significant portion of samples may not be measured completely due to insufficient tissues or restricted budget (e.g. gene expression profile are measured but not methylation). Current multi-omics integrative methods require complete data. A common practice is to ignore samples with any missing platform and perform complete case analysis, which leads to substantial loss of statistical power.
Methods
In this article, inspired by the popular Integrative Bayesian Analysis of Genomics data (iBAG), we propose a full Bayesian model that allows incorporation of samples with missing omics data.
Results
Simulation results show improvement of the new full Bayesian approach in terms of outcome prediction accuracy and feature selection performance when sample size is limited and proportion of missingness is large. When sample size is large or the proportion of missingness is low, incorporating samples with missingness may introduce extra inference uncertainty and generate worse prediction and feature selection performance. To determine whether and how to incorporate samples with missingness, we propose a self-learning cross-validation (CV) decision scheme. Simulations and a real application on child asthma dataset demonstrate superior performance of the CV decision scheme when various types of missing mechanisms are evaluated.
Availability and implementation
Freely available on the GitHub at https://github.com/CHPGenetics/FBM
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Multi-level omics data refer to the combined molecular data in various types, for example, genome, transcriptome, methylome and proteome data, measured on a common cohort of patients. The availability of multi-level omics data in both magnitude and varieties poses challenges as well as opportunities to understand fundamental mechanisms of diseases and pathologies. Compared with separately discovering the association patterns between each omics data and phenotypes, an integrative framework that simultaneously integrates multiple omics data types will uncover more insightful regulatory machineries between different omics data and will hence deepen our understanding to hereditary and environmental causes in pathology (Richardson et al., 2016; Tseng et al., 2012, 2015).
In the literature, many strategies have emerged for integration of multi-omics data. To cluster samples for identifying unknown disease subtypes, integrative clustering (iCluster; Shen et al., 2009), Bayesian consensus clustering (Lock and Dunson, 2013), group structured integrative clustering (GS-iCluster; Kim et al., 2017) and integrative sparse K-means (IS-Kmeans; Huo et al., 2017) have been developed for integrative clustering of multi-omics data. For association and prediction modelling, Integrative Bayesian Analysis of Genomics (iBAG) (Wang et al., 2013) investigates association patterns of mRNA expressions and methylation with clinical outcome via a mechanistic model for associating methylation with gene expression and then a clinical model for direct association between expression and clinical outcome or via methylation. A Bayesian hierarchical model is then established for the inference. Such Bayesian hierarchical modelling has gained popularity in multi-omics integrative analysis due to its flexibility in model construction for complex regulatory structure, convenience to incorporate prior biological knowledge and advances in modern computing. In the association and prediction modelling, we usually focus on two key goals in the inference: firstly, to select predictive biomarkers for the phenotype and secondly, to predict clinical outcome from the selected biomarkers. As tens of thousands of features are available in omics data, the first goal of feature selection is essential in interrogating the biological and pathological mechanism of a targeted disease. The second goal could be of immediate clinical use, for example, assisting diagnosis of a disease, directing the best treatment decision and predicting drug response.
One obstacle for applying multi-omics integrative methods in real applications is missing data. Due to various reasons (e.g. limited budget, bad tissue quality or insufficient tissue amount), it is common that only partial samples have all omics data types (Voillet et al., 2016). For example, TCGA breast cancer study had 922 samples measured in methylation array, yet only 781 were measured in miRNA expression and 587 were measured for gene expression. Almost 40% of the samples are missing at least one type of omics data. To circumvent this pitfall, a naïve and convenient solution is by complete case (CC) analysis, where samples with any missing measurement are ignored. This approach results in dramatic decrease of sample size and thus decreases of statistical power, especially when more omics data types are combined. The shortcoming of CC in omics studies is recently noticed by statisticians, for example, Voillet et al. (2016) who developed a multiple imputation approach focusing on multiple factor analysis for multiple omics data. However, a unified framework serving the aforementioned purpose of feature selection, prediction as well as missingness handling is still lacking. In Bayesian methods for analysis of data with some predictors that are missing at random, marginal distribution of those predictors are usually modelled (Ibrahim et al., 2002; Little and Rubin, 2002). The data augmentation method is used to obtain the posterior distribution of parameters of interest (Tanner and Wong, 1987). In each iteration of the data augmentation procedure, missing values are imputed from the conditional distribution of those covariates given observed data under current parameters and model parameters are subsequently drawn from their posterior distribution calculated from the imputed dataset. In this article, we are motivated by the iBAG model combining mRNA expression methylation, clinical variables to predict a targeted continuous outcome. We propose a full Bayesian model with missingness (FBM) that allows iBAG to handle situations when partial samples are missing with mRNA expression or methylation. Extensive simulations and real applications demonstrated superior performance in feature selection and prediction accuracy of the new approach compared with naïve CC approach. The model can be extended to other omics data types or other targeted outcomes (e.g. binary or survival).
The article is structured as the following. In Section 2.1, we introduce the motivation, the original iBAG model and the proposed FBM. Section 2.2 discusses the inference of prediction and feature selection of FBM (Section 2.2.1) and evaluation benchmarks (Section 2.2.2). Section 3 contains extensive simulations to evaluate performance of FBM and Section 4 proposes a cross-validation (CV) decision scheme to determine whether and how to incorporate samples with missingness in FBM. Section 5 includes an application to a childhood asthma dataset with 460 individuals. Final conclusion and discussion are presented in Section 6.
2 Materials and methods
2.1 Motivation and the full Bayesian model with missingness
2.1.1 Motivation
IBAG is a two-layer Bayesian hierarchical model for vertical integrative analysis of multi-level omics data, assuming data are complete. However, in reality, a large proportion of missing data is commonly seen due to budget or limitation in tissue collection. Figure 1a gives an example of data structure with missingness. Suppose there are a total of N samples, indicates the clinical outcome of interest; C indicates clinical factors; indicates the gene expression with missingness, where is the missing indicator; likewise, indicates methylation data with missingness, where is the missing indicator. U = 1 indicates missing and U = 0 indicates observed. For example, number of samples with methylation level missing is . We assume for . The original iBag model takes only the complete data and is subject to loss of statistical power. To handle missingness, we propose a full Bayesian model with missingness inspired by iBAG model, to achieve three goals simultaneously: feature selection, prediction and missing data imputation. In Section 2.1.2, we briefly introduce the iBAG model. In Section 2.1.3, we propose our full Bayesian model for multi-omics integration with missingness imputation.
2.1.2 iBAG
In iBAG, the mechanistic model in the first layer assesses gene-methylation effect and divides gene expressions into two parts, the part regulated by methylation () and the part regulated by other mechanisms ().
where denotes the ‘gene-methylation’ effect that estimates the (conditional) effect of jth methylation feature on the kth gene.
The second layer called clinical model assesses the association between gene expression ( and ) and the phenotype:
where denotes the effect of the lth clinical factor on the clinical outcome Y. denotes the gene expression effect of on clinical outcome Y, called a type M effect. denotes the gene expression effect of on clinical outcome Y, called a type effect.
2.1.3 Full Bayesian model with missingness
Figure 1b gives a full representation of our model. Our model contains three parts: mechanistic model, clinical model and an imputation model derived from the previous two models. The mechanistic model and clinical model stem from the iBAG model introduced in Section 2.1.2. Here, we will focus on two novel parts added to the original iBAG model: the sparsity-induced spike-and-slab priors (Ishwaran and Rao, 2005) for feature selection in both mechanistic and clinical models, and an imputation model to deal with missing omics data.
There are a total of eight groups of parameters that need to be estimated: γM, , γC, Ω, ’s, , γM and are our parameters of interests. Investigators may also find Ωs important if they are interested in further inference on methylation-gene regulation. The corresponding Monte Carlo Markov Chain (MCMC) Gibbs sampler will be further discussed in Supplementary Appendix Section 2.
The original iBAG model placed a Laplace prior on γM and for shrinkage purpose, however, it does not set the effects to exact zeros. To conduct natural variable selection, we instead use a spike-and-slab prior to induce sparsity. In addition, we also perform feature selection in ω using the same spike-and-slab prior, considering that not all the methylation sites are regulating the gene expression. So we will have:
where and are binary indicators and represents a narrow spike and represents the wide slab. A Jeffery’s prior is put on , i.e.:
For and (Fig. 1b), we impose the following imputation model:
where MVN denotes the multivariate normal distribution and if the methylation value is already standardized with mean 0 and standard deviation 1. Note that here we assume the methylation data are in according to Du et al. (2010). If is to be used, we may need to replace the above prior with a truncated normal distribution bounded between 0 and 1. All other variables are given non-informative priors. Details can be found in the Supplementary Appendix Section 2.
Remarks:
Following Wang et al. (2013), we assume gene-gene independence and methylation-methylation independence. However, we are aware that this assumption deviates from real data, so we also propose a model with gene-gene and methylation-methylation correlations considered (see Supplementary Appendix Section 3). Albeit having the options, we will mainly discuss our model with independence assumptions in this article, for two reasons: first, the computational burden for estimating covariance matrices are high. Second, with moderate correlations in simulated data, considering dependency has limited gain in performance compared with model with independence assumption.
We assume that the methylations are many-to-one mapped to genes, i.e. only the methylation within the promoter region of the gene is mapped to the gene, and each methylation will only be mapped to one gene. The mapping is also assumed to be known from biology background knowledge in our model. The methylation level is centred around 0 if we are using m-value.
It is reasonable and necessary to assume that if for all the ωj where j are within the promoter region of gene k, we let , then we automatically have . That is, only when at least one methylation is selected.
Two strategy could be adopted to deal with collinearity issue between methylation sites. First, we proposed a random selection strategy based on pairwise Pearson correlation (see Supplementary Appendix Section 4). Second, one may conduct principal component analysis (PCA) on methylation sites, and use the first few PCs explaining large proportion of variation.
2.2 Inference and evaluation
The full Bayesian hierarchical model in the last section allows fast Gibbs sampler. Full conditional formula for iterative sampling are shown in Supplementary Appendix Section 2. The final parameter estimates are calculated by averaging stabilized MCMC iterations (i.e. removing the first Br burn-in period in MCMC iterations). The burn-in period Br is determined using Geweke’s convergence diagnostics (Geweke, 1992). Geweke diagnostics aims to test whether the first and last of the MCMC iterations have equal mean, and thus decide whether the samples are drawn from a stationary distribution. As suggested by Geweke, the first 10% of MCMC total iterations are taken as burn-ins once the MCMC chain pass the diagnosis.
2.2.1 Prediction and feature selection
The Bayesian integrative model generates two major inference outcomes: prediction and feature selection. For prediction, denote , and as the simulated parameter estimates from the b-th iteration. For a new sample with omics data (), we average prediction of y from the () stable MCMCs by , where . And RMSE will be calculated from those estimates:
(1) |
where the choice of N is discussed in Section 2.2.2.
Next, we summarize feature selection indicators and for gene k and the b-th MCMC to determine the set of genes predictive to outcome y. Given the b-th iteration, we define the selection indicator for gene k as so that gene k is selected if either the impact on outcome is through methylation () or not (). In other words, for and . To control FDR at gene level, we apply Bayesian false discovery rate (BFDR) proposed by Newton (2004):
where is the posterior probability of gene k belonging to null hypothesis H0 (i.e. gene k is not selected given observed data ), , and t is a tuning threshold. Given the definition of BFDR, the q-value of gene k can be defined as . This q-value will later be used for feature selection decision, which is comparable to frequentist approaches. Among selected genes, one may perform post hoc analysis to further investigate and and determine whether the impact of gene k to outcome is through methylation, non-methylation or both.
2.2.2 Benchmarks for evaluation
To evaluate the performance, the basic approach we compare with is the CC analysis. For full Bayesian model with missingness, we will also choose to impute gene expression only (FBMG), methylation only (FBMM) or both (FBMGM) when applicable. In simulation studies, we also perform analysis of the complete data (full) to examine the reduction of accuracy caused by missingness.
To benchmark performance of outcome prediction, we consider RMSE (Equation 1). In simulation studies, after parameter estimates are obtained we generate a new testing dataset with large sample size (N = 2000) and compute RMSE. In real data analysis, we perform 50-fold cross-validation (CV) on complete cases for RMSE evaluation. In each iteration, of complete-case samples are set aside as test dataset, all remaining data are used to perform CC and FBMs analyses. The parameter estimates are then applied to the test data for outcome prediction. After ten iterations, cross-validated RMSE can be evaluated on all complete-case samples with in Equation 1.
To evaluate feature selection performance, we order genes by q-value, plot receiver operating characteristic (ROC) curves and calculate area under curve (AUC) in simulations since the underlying true predictive genes are known. For real data, since we do not know the true features, we treat the gene selection result from full data analysis (full) as a surrogate of gold standard and compare gene selection from CC (or FBMs) to the full data analysis by tracing the top number of selected genes on the x-axis (e.g x = 100 top selected genes by CC and full) and the overlapped number from CC and full on the y-axis (Fig. 4b). Comparing the curves of CC and FBMs, a higher curve closer to the diagonal line shows more similar gene selection to ‘full’ and thus an indication of better performance.
3 Simulation studies
3.1 Simulation schemes
To evaluate performance of our full Bayesian model with missingness, we perform simulation based on data structure described in Section 2. Specifically, the clinical and methylation data matrices are simulated from N(0, 10) and N(0, 1) with , L = 2 and J = 2000. Each methylation site is randomly assigned to a gene, with the constraint that each gene contains at least one methylation site. In the mechanistic model, and for and , where the total number of genes K = 1000. We then simulate gene expression matrices from . In the clinical model, 10 genes are randomly selected to impact clinical outcome through methylation and 10 randomly impact not through methylation (i.e. The IM vector has 10 out of K genes equal one and the remaining are zero, and similarly for . The selected genes in IM and can possibly overlap). For selected genes in IM and , the corresponding and are set to 10. The coefficients for clinical data () are also set at 10 and to simulate clinical outcome Y.
After full multi-omics datasets are simulated, data with missingness are generated with of samples with missing gene expression data and another non-overlapping of samples with missing methylation data. We simulate three scenarios of missingness: (I) Missing only gene expression data with (α, β)=(0.1, 0), (0.2, 0) and (0.5, 0); (II) Missing only methylation data with (α, β)=(0, 0.1), (0, 0.2), (0, 0.5); (III) Non-overlapping samples missing either gene expression or methylation data with (α, β)=(0.1, 0.1), (0.2, 0.2), (0.3, 0.3). For Scenario I, we evaluate CC and FBMG approaches and compare with full. Similarly for Scenario II, we compare CC, FBMM and full. Finally for Scenario III, we compare CC, FBMG, FBMM, FBMGM and full. In this case, FBMG imputes gene expression but ignores samples with missing methylation and similarly, FBMM imputes methylation but ignores samples with missing gene expression. FBMGM utilizes all samples and imputes both gene expression and methylation.
3.2 Results
Figure 2 shows the outcome prediction performance by RMSE for all three scenarios. We first focus on small sample size situations . In Scenario I, CC, FBMG and full have similar performance when missing but FBMG clearly outperforms CC when missingness increases to 20% and 50%, showing the benefit of imputation as expected. In contrast, FBMM performs much worse than CC in Scenario II with methylation missingness and FBMM only slightly outperform CC when missingness increases to 50%. Results of Scenario III are consistent with results of Scenarios I and II. FBMM and FBMGM perform worse than CC at . FBMM, FBMG and FBMGM outperform CC at . It is worth noting that when sample size increases to N = 500, the data information is strong enough such that CC has performance similar to full. Imputation almost always create more data uncertainty and have worse performance than CC, especially since the majority of the imputed data are irrelevant to the clinical outcome.
We next examine feature selection performance by AUC values in Table 1 (ROC curves shown in Figure 3 and Supplementary Figure S1a–c). In Scenario I, FBMG always has higher AUC values than CC especially when missingness α increases to 25% or 30%. On the contrary, FBMM has lower AUC than CC except for N = 50, 100 in . The message becomes mixed in Scenario III as expected.
Table 1.
Scenario I |
Scenario II |
Scenario III |
||||||||
---|---|---|---|---|---|---|---|---|---|---|
Missing | N | Full | CC | FBMG | CC | FBMM | CC | FBMG | FBMGM | FBMM |
Lowa | 50 | 0.676 | 0.649 | 0.672 | 0.631 | 0.613 | 0.616 | 0.628 | 0.652 | 0.583 |
100 | 0.899 | 0.867 | 0.878 | 0.883 | 0.777 | 0.826 | 0.753 | 0.826 | 0.703 | |
200 | 0.967 | 0.976 | 0.983 | 0.961 | 0.863 | 0.946 | 0.859 | 0.913 | 0.789 | |
Meda | 50 | 0.676 | 0.619 | 0.665 | 0.626 | 0.597 | 0.574 | 0.61 | 0.646 | 0.528 |
100 | 0.899 | 0.813 | 0.862 | 0.807 | 0.774 | 0.762 | 0.745 | 0.826 | 0.699 | |
200 | 0.967 | 0.941 | 0.961 | 0.939 | 0.861 | 0.903 | 0.858 | 0.917 | 0.795 | |
Higha | 50 | 0.676 | 0.53 | 0.637 | 0.557 | 0.588 | 0.534 | 0.583 | 0.636 | 0.512 |
100 | 0.899 | 0.699 | 0.835 | 0.681 | 0.742 | 0.638 | 0.75 | 0.833 | 0.7 | |
200 | 0.967 | 0.892 | 0.964 | 0.88 | 0.814 | 0.828 | 0.782 | 0.883 | 0.765 |
Low, medium and high missing proportion for Scenario I and II are 10%, 25% and 50% for either α or β, respectively; and for Scenario III is 10%, 20% and 30% for both α and β.
We noticed that since methylation is the up-regulator of gene expression and indirectly impact the clinical outcome, imputing methylation is generally less effective than imputing gene expression. With increasing missing proportion, the benefit of imputation escalates. But when the missing proportion is small or sample size large, the uncertainty brought by imputation will overshadow the contribution from partially observed data (see Supplementary Appendix Section 5).
4 Impute or not: a decision scheme by cross-validation
From the mechanistic model in Figure 1b, gene expression and methylation data are not symmetric. Methylation data can be predictive to gene expression data and further predictive to outcomes. On the other hand, gene expression data are less predictive to methylation to help outcome prediction. As we have shown in simulations, imputations do not always improve prediction performance compared with CC analysis. Whether imputation would improve outcome prediction depends on the types of missingness (missing the upstream methylation data or missing the downstream gene expression data or both), the proportion of missingness and sample size. To guide the decision, we propose a self-learning CV scheme. Specifically, we apply 10-fold CV by leaving 10% of CC samples as the test set (for Scenario III N = 500, we apply 50-fold CV). We apply CC, FBMG, FBMM and FBMGM to the remaining training data, calculate parameter estimates and apply to test set. The procedure is repeated across all 10-folds and RMSE can be calculated on all test sets. The method with the smallest RMSE is selected to determine whether and how to impute. We conducted sensitivity analysis for this CV scheme (see Supplementary Appendix Section 1.2). We note that to evaluate performance of CV scheme in RMSE evaluation of outcome prediction in the Section 5, nested CV will be used (i.e. An outer loop of CV is used to evaluate RMSE and an inner loop of CV scheme for method selection is performed in each training set of the outer loop).
Figure 4 shows scatter plot of RMSE performance comparing FBMG and CC in Scenario I (Fig. 4a; missing gene expression) and FBMM and CC in Scenario II (Fig. 4b; missing methylation) in 50 independent simulations. In Scenario I, FBMG generally has smaller RMSE than CC in small sample sizes (N = 50, 100, 200). But for N = 500, RMSE of FBMG becomes slightly larger than CC on average. For missing methylation in Scenario II, FBMM performs better than CC at N = 50 but gradually becomes worse than CC at N = 100, 200 and 500. We applied the CV scheme in each simulation to determine whether imputation should be performed or not. Simulations shown by circles represent correct decisions (i.e. CV scheme decides to impute and the RMSE of imputation is indeed smaller than RMSE of CC or vice versa) and cross represents incorrect decision. The result shows universally high accuracy of CV scheme decision. When the decision is wrong, imputation and CC RMSEs are close to each other (near the diagonal line) and the incorrect decision only minimally impacts the outcome prediction. Figure 4c shows box plots of RMSE generated from different approaches (CC, FBMG, FBMM, FBMGM, CV) for Scenario III (). At N = 50, FBMG and FBMGM both perform well and CV generates similar small RMSE. When N increases to 500, FBMGM becomes much worse and CC performs slightly better than FBMG. Again, the CV scheme makes mostly correct decision and thus generates small RMSE close to the lowest. In conclusion, the simulation results indicate effectiveness of the CV scheme in determining the best strategy of whether and how to impute when encountering missingness in multi-omics data.
5 Real application
5.1 Data and approach
We apply the proposed full Bayesian model with missingness to 460 children asthma nasal epithelium samples obtained from asthma study at Children’s Hospital of Pittsburgh, with complete DNA methylation data from Illumina 450k chips and RNA-Seq gene expression data. All data were preprocessed with standard procedures and bioinformatics tools. We used M-value for methylation level for better model fitting. The RNA-seq gene expression counts were transformed to TPM (transcripts per million), a continuous value also more valid for the model assumptions. We filter out genes with small mean expression level (<130.41) or small standard deviation (<23.83) to obtain K = 1000 genes for the analysis. We then select methylation sites matched to these 1000 genes. Since some genes have many corresponding methylation sites, we perform PCA to identify the top eigen-methylation sites as input to the model. We take no more than three PCs per gene or less than three PCs that explain at least 75% variation. This generates J = 2619 eigen-methylation features for the analysis. The PCA analysis reduces redundant (highly correlated) information in methylation sites and independence of eigen-methylation features fits the model assumptions well. Similar to simulation, we generate three scenarios of missingness: (I) , (II) , (III) . For each scenario, we repeat 50 times.
Serum Immunoglobulin E (IgE) level is a primary clinical outcome in children asthma studies. We take log-transformed IgE level as our clinical outcome, and age and gender as clinical variables. Together with the gene expression and methylation PCs, we run our model with full data and three scenarios of missingness to compare CC approach and full Bayesian model with missingness (FBMG, FBMM and FBMGM).
5.2 Outcome prediction and feature selection
Table 2 shows the RMSE of outcome prediction from complete case analysis and FBMs. When 50% samples have missing gene expression (Scenario I), FBMG had reduced RMSE compared with CC (dropped from 43.34.19 to 36.04). In contrast, FBMM had inflated RMSE compared with CC when 50% of missing methylation (51.59 compared with 39.54 in Scenario II). When gene expression and methylation both missed 20% of samples (i.e. 40% of samples had missing values in Scenario III), FBMG had the smallest RMSE (38.71) compared with FBMGM (46.15), FBMM (54.23) and CC (39.09). Our CV scheme performed the automatic selection of whether and how to impute, and the RMSE was always close to the best (I: 36.61, II: 41.09 and III: 40.02). Figure 5a shows scatter plot or box plot of RMSEs of different methods in all three Scenarios. Similar to the simulation result, CV scheme mostly selected the best method and the mistakes were near the diagonal line with little predictive impact.
Table 2.
Methods | Scenario I | Scenario II | Scenario III |
---|---|---|---|
CC | 43.34 (5.33) | 39.54 (6.85) | 39.09 (5.43) |
FBMG | 36.04 (4.73) | 38.71 (5.73) | |
FBMGM | 46.15 (4.74) | ||
FBMM | 51.59 (6.71) | 54.23 (4.50) | |
CV | 36.61 (3.78) | 41.09 (6.17) | 40.62 (7.36) |
Unlike in simulation, no underlying truth is available for real data and thus calculation of AUC is not possible. Figure 5b treats the predicted outcome from full data as the surrogate of underlying truth and compare feature selection from each method with the surrogate (i.e. x-axis shows the same number of features selected by the designated method and full and y-axis demonstrates the overlap between the two). A curve with higher overlap shows better similarity of feature selection with the surrogate, an indication of better performance. Similar to the RMSE result, FBMG performed better than CC in Scenario I, CC better than FBMM in Scenario II, and FBMG and CC performed better than FBMM and FBMGM in Scenario III. CV scheme performs close to the best. All results from this real application are largely aligned with our observations in simulation studies.
We next investigate whether feature selection from the imputation methods or method selected by CV scheme represent better functional annotation in a biological sense. We obtained the top 200 genes from feature selection of each method by posterior probability order and then performed pathway enrichment analysis by one-sided Fisher’s exact test. We collected 2467 pathways from four pathway databases (KEGG, Reactome, Biocarta and GO) with the restriction of pathway size between 10 and 500. The enrichment p-values were then adjusted for multiple comparisons by Benjamini–Hochberg procedure. We found 27 enriched pathways from the full data analysis under FDR = 1%. We used these 27 pathways as a surrogate of gold standard to benchmark functional annotation performance of each method. Figure 6 shows box-plots of the minus log-transformed p-values from pathway enrichment of the 27 pathways in each method. As expected, FBMG had better p-value significance distribution than CC in Scenario I. In Scenario II, CC performed better than FBMM. FBMGM and FBMG outperformed CC in Scenario III. The CV scheme automatically determined whether and how to impute, and it always perform close to the best in each scenario.
6 Discussion
Integrative analysis of multi-level omics data brings unique insights to the modulating relationship between different types of omics data. Feature selection and model prediction are two important goals in multi-omics integration, which empowers discovery of disease-associated biomarkers, survival prediction and risk assessment. Several methods have been developed to fulfil these goals, including iBAG using a two-layer Bayesian hierarchical model to discover both association between genes and clinical outcome, and that between gene and upstream regulators. However, none of these methods are able to handle the potential large proportion of missing data in the data integration. In this article, we propose a full Bayesian model with missingness (FBM) inspired by iBAG model, to jointly perform feature selection, model prediction and missing data incorporation. In addition to the mechanistic model and clinical model originally proposed by iBAG, FBM includes a third layer of missingness model to incorporate samples with missingness. The flexibility of Bayesian hierarchical modelling and Gibbs sampler technique enables us to jointly model association among data and infer parameters in all three layers of models together. The Laplace (double exponential) prior initially used in iBAG could not realize exact feature selection. FBM applied spike-and-slab prior for more effective feature selection and allows Bayesian FDR control. We demonstrated outcome prediction and feature selection performance of FBM using extensive simulations.
From the extensive simulations, we then further realized that imputation is not always favoured over complete case analysis. For example, in high dimensional genomic data analysis, when missingness exists in upstream regulators (i.e. methylation), uncertainty will be increased to impair final inference. When the missing proportion is relatively small and the sample size is large (i.e. signal is strong), CC analysis also often outperforms FBM. To decide the best handling of data with missingness, we proposed a self-learning cross-validation decision scheme. Previously, we have developed a similar self-training selection scheme to select the best microarray missing value imputation method and its downstream biological impact (Brock et al., 2008; Oh et al., 2011). In both simulation and the childhood asthma application, we showed superior performance of the CV scheme in prediction outcome and feature selection.
While Bayesian hierarchical model allows complex parameter structures, it also comes with a computational cost when using conventional inference approaches such as Metropolis–Hastings or its special case, Gibbs sampling. In FBM, fast Gibbs sampling was applicable using conjugate priors and the convergence was generally fast (B = 2000). We have optimized the R code using C++ with Rcpp package. The computing takes 90 min for a reasonably large dataset with N = 500 samples, K = 1000 genes, J = 2000 methylations and B = 2000 MCMC iterations using a regular computer with 1 Intel Xeon CPU (2.40 GHz). Since the computation burden grows linearly with sample size, feature size and number of iterations, applying FBM with parallel computing could be a solution to substantially speed up computing and allow routine omics applications. An R package, data and source code to replicate all results in this article are available on GitHub (https://github.com/CHPGenetics/FBM).
Funding
This work was supported by the National Institutes of Health [RO1 CA190766 to Z.F., T.M., L.Z., and G.C.T.’s, RO1 MD011764 and RO1 HL117191 to W.C. and J.C.C.’s].
Conflict of Interest: none declared.
Supplementary Material
References
- Brock G.N. et al. (2008) Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Biostatistics, 9, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geweke J. (1992) Evaluating the accuracy of sampling-based approaches to calculating posterior moments In: Bernado J.M.et al. (ed.) Bayesian Statistics 4. Oxford University Press, New York, pp. 169–193. [Google Scholar]
- Huo Z., Tseng G.C. (2017) Integrative sparse K-means with overlapping group lasso in genomic applications for disease subtype discovery. Ann. Appl. Stat., 11, 1011–1039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ibrahim J.G. et al. (2002) Bayesian methods for generalized linear models with covariates missing at random. Can. J. Stat., 30, 55–78. [Google Scholar]
- Ishwaran H., Rao J.S. (2005) Spike and slab variable selection: frequentist and Bayesian strategies. Ann. Stat., 33, 730–773. [Google Scholar]
- Kim S. et al. (2017) Integrative clustering of multi-level omics data for disease subtype discovery using sequential double regularization. Biostatistics, 18, 165–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Little R.J.A., Rubin D.B. (2002) Statistical Analysis with Missing Data. 2nd edn. Wiley, New York. [Google Scholar]
- Lock E.F., Dunson D.B. (2013) Bayesian consensus clustering. Bioinformatics, 29, 2610–2616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newton M.A. (2004) Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5, 155–176. [DOI] [PubMed] [Google Scholar]
- Oh S. et al. (2011) Biological impact of missing-value imputation on downstream analyses of gene expression profiles. Biostatistics, 27, 78–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Du P. et al. (2010) Comparison of beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics, 11, 587.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richardson S. et al. (2016) Statistical methods in integrative genomics. Annu. Rev. Stat. Appl., 3, 181–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen R. et al. (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25, 2906–2912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tanner M.A., Wong W.H. (1987) The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc., 82, 528–540. [Google Scholar]
- Tseng G.C. et al. (2012) Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res., 40, 3785–3799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tseng G.C. et al. (2015) Integrating Omics Data. Cambridge University Press, New York. [Google Scholar]
- Voillet V. et al. (2016) Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework. BMC Bioinformatics, 17, 402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang W. et al. (2013) iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data. Bioinformatics, 29, 149–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.