Profile-guided Hybrid Approach for block-wise missing data handling in multi-omics: a breast cancer case study

Esraa Hamdi Abdelaziz; Eman Amin; Rasha Ismail; Mai Mabrouk

doi:10.1186/s13040-026-00530-8

. 2026 Mar 20;19:23. doi: 10.1186/s13040-026-00530-8

Profile-guided Hybrid Approach for block-wise missing data handling in multi-omics: a breast cancer case study

Esraa Hamdi Abdelaziz ^1,^✉, Eman Amin ¹, Rasha Ismail ¹, Mai Mabrouk ^2,^✉

PMCID: PMC13063710 PMID: 41862959

Abstract

Background

Block-wise missingness is a common challenge in multi-omics data, hindering the development of robust and generalizable machine learning models, as real-world cohorts rarely contain complete omic profiles. Many current methods either discard incomplete samples, use available-case models that need retraining when faced with new missingness patterns, or depend on full-dataset imputation, which can risk biological integrity and model stability.

Methods

Using a complete four-omics breast cancer dataset (705 patients, 1,937 features), up to 60% block-wise missingness was simulated across five clinically relevant scenarios and used to compare four strategies for handling missing data: an Imputation-Based model, Dynamic and Exhaustive Available-Case approaches, and the proposed Hybrid Approach that combines profile-guided modeling with selective, test-time imputation. Performance was evaluated using accuracy, F1 score, balanced accuracy, inference time, and variability across 15 random seeds, with significance assessed using the Wilcoxon signed-rank test.

Results

The Hybrid Approach consistently achieved the strongest and most stable performance. Relative to the complete-data baseline, it reached an average accuracy of 103.7%, F1 score of 123.3%, and balanced accuracy of 104.8%, outperforming the Imputation-Based method and matching or exceeding both Dynamic and Exhaustive Available-Case strategies. Statistical testing confirmed that these improvements were significant. The method also demonstrated fast and predictable inference (~ 2 s) and an average total runtime of ~ 49 s per configuration—nearly three times faster than the Exhaustive approach (~ 124 s)—while maintaining high reproducibility and low variance across seeds, a key indicator of computational stability.

Conclusion

By selectively combining lightweight imputation with profile-specific modeling, the Hybrid Approach provides a computationally efficient and statistically robust solution for block-wise missing data. This framework offers a generalizable strategy for multi-omics data mining, and lays the foundation for future systems incorporating cross-profile learning and advanced imputation.

Keywords: Block-wise missingness, Multi-omics integration, Imputation, Available-case modeling, Deep learning, Hybrid Approach, Multi-omics data mining

Background

The integration of multi-omics data has emerged as a powerful approach for uncovering complex biological mechanisms and improving disease prediction, diagnosis, and treatment strategies. Given the inherently complex and hierarchical nature of multi-omics data, advanced computational methods, including network-based analyses, are increasingly necessary to encode and interpret the relationships between omics layers [1]. Furthermore, recent reviews highlight the growing reliance on deep learning and machine learning to effectively integrate these high-dimensional datasets while emphasizing that handling incomplete or missing modalities remains an unresolved challenge for multi-omics modeling [2]. However, although these models show great promise, they cannot be reliably deployed unless they remain robust when certain omics are incomplete.

Despite the growing availability of multi-omics datasets, missing data remains a critical challenge since patients are usually represented for some but not all omics layers in multi-omics datasets due to cost constraints, technical limitations, or experimental considerations [3]. A very common pattern of missingness in multi-omics datasets is block-wise missingness, which refers to a structured pattern of missing data where an entire block (omic) of related features is missing together, not just random individual values [4]. For example, a multi-omics dataset might include genomics, transcriptomics, and proteomics data, but some samples or subjects may be missing one or more of these data types. In real-world scenarios, it is common for certain omics layers to be entirely missing for subsets of samples, leading to difficulties in integration and downstream analysis.

According to a recent comprehensive review on missing multi-omics data [5], existing methods for handling block-wise missing multi-omics data can be broadly classified into three main categories. First, naive approaches which include discarding incomplete samples or training models on a single omics block shared across datasets. However, these strategies are very impractical especially if most patients have a missing omic or more which would lead to heavy information loss. Second, Imputation-Based approaches, which aim to reconstruct missing values using general techniques. These approaches are effective for handling randomly scattered missing values. However, in block-wise missingness, where large portions of entire feature groups are missing, these approaches can lead to inaccurate estimations, which may distort the data distribution and harm model performance [6]. Third, available-case approaches avoid imputation altogether by training separate models based on the missingness structure in the datasets. Although these approaches do not need to impute missing data and therefore preserve biological integrity, they often must be retrained for each different missingness pattern in the test data, which may be computationally expensive in practice.

In this paper, we propose a Hybrid Approach for patient-level outcome prediction and prognostic modeling under block-wise missing multi-omics data, combining elements of available-case modeling and imputation-based strategies. Our approach constructs specialized prediction models based on the missingness patterns observed in the training dataset, allowing maximal information to be extracted while preserving the biological integrity of the data by minimizing reliance on imputation. Imputation is applied only at the testing stage, and solely to align a given test sample with one or more of the pre-trained models. This design eliminates the need to retrain models when encountering new missingness patterns in the test data, offering a flexible solution for block-wise missingness in multi-omics datasets and improving the robustness of downstream predictive modeling.

To evaluate our approach, we designed an experiment to compare it against three representative approaches for handling missing data. These include a full imputation approach and two available-case approaches: one that trains models only for the missing patterns present in the dataset, and another that builds models for all possible missingness patterns. The proposed Hybrid Approach is then assessed alongside these approaches to demonstrate its effectiveness.

This study introduces a novel Hybrid Framework, demonstrating the feasibility and significant computational advantages of combining available-case approaches with selective imputation. Since the current implementation prioritizes conceptual clarity over optimization, relatively simple models are used and a basic approach to available-case data categorization. However, we acknowledge that incorporating more advanced modeling techniques, refined imputation methods, and shared learning across profiles could further enhance predictive performance.

The remainder of this paper is organized as follows: Sect. “Related work” reviews related work on multi-omics missing data handling. Section “Materials and Methods” describes the datasets, experimental design, baseline methods, and our proposed approach. Section “Results” presents the comparative results across various missingness scenarios. Section “Discussion” discusses the findings, limitations, and future directions. Finally, Sect. “Conclusion” concludes the paper.

Related work

Block-wise missing data remains a major challenge in multi-omics integration. Although many predictive modeling strategies have been developed, only a few are specifically designed to offer explicit support for this type of missingness [7]. In this section, we summarize recent studies that address block-wise missing data either through Imputation-Based techniques or available-case approaches.

Many conventional Imputation-Based approaches are not optimized for block-wise missing data. However, one popular approach is the TOBMI kNN [8] method which was proposed to impute missing RNA-seq data using information from the DNA methylation data. It uses DNA methylation data to identify k “donor” samples using Mahalanobis distance. Then, it uses a weighted average of these donors to impute missing gene expression values. Although effective for cross-omics reconstruction, TOBMI does not explicitly address cases where entire omics blocks are missing.

Xue and Qu [9] handle block-wise missing data by grouping samples with the same missing patterns and performing multiple imputations within each group. The models are trained on each imputed dataset and the results are combined for better predictions. On the other hand, the Priority-LASSO-impute method [10] handles missing data by giving priority to different omics blocks based on their importance. It builds prediction models step-by-step, starting with the most important block. Instead of directly filling in missing data, it estimates missing parts based on information from higher-priority blocks.

A clear limitation of imputation approaches is that increased sophistication, especially when performing multiple imputation, often leads to significantly higher computational demands [5]. Moreover, because imputation techniques rely on model-based or random predictions that could vary between analyses, reproducibility might decrease and the biological integrity in the data may also be compromised. Studies have shown that the choice of imputation strategy can significantly influence downstream results and model stability [11], highlighting the need for cautious application of imputation or the use of alternative strategies [12].

To avoid imputation, the concept of profiles where datasets are partitioned according to the availability of omics, was introduced. For example, if three omics are available A, B, and C, then one group could have complete data of A and B, the other would have complete data of B and C and so on. An early study [13] proposed an ensemble-based classification framework to handle block-wise missing features by training classifiers on subsets of available modalities and combining their outputs via weighted aggregation.

Similarly, the iMSF model [14] splits the data into separate, non-overlapping groups based on which sources are available and trains a different model for each group. It doesn’t allow sharing between groups, in contrast to the improved version, the iSFS model [15], which allows overlapping sources across groups, and learns shared feature weights. The method assigns weights to each omic according to its relevance to the prediction task. It prioritizes the most informative sources by giving them higher weights, while sources with little or no contribution receive zero weight and are effectively excluded from the model.

To classify asthma outcomes in children, authors in [16] train individual models for each omics block independently, using all samples containing that block. Predictions for a test sample are obtained by combining outputs from the models corresponding to the blocks present, weighted according to cross-validated AUC values.

Another approach using the profile concept is presented in [17] where samples are categorized into profiles and a unified global model is trained by learning shared source-specific parameters and profile-specific weights, allowing efficient use of all available data without imputation. This framework was extended in [18] for multi-class classification support in addition to the previously supported binary and continuous response types.

By directly addressing missing data without imputation, previously mentioned methods provide an advantage and preserve the biological integrity of the data. However, they often face a choice between retraining whenever an unseen missingness pattern appears, pretraining every possible block combination, which is computationally expensive [5] or simply ignoring the missing omic at test time, thereby discarding information that the already-trained models could still exploit.

A separate but related class of methods leverages advanced machine learning for either sophisticated imputation or latent-factor-based integration. For instance, deep learning–based imputers such as GAIN and Deep Matrix Factorization (DeepMF) aim to reconstruct missing values with high fidelity by modeling complex data distributions [19, 20]. Similarly, integration frameworks like Multi-Omics Factor Analysis (MOFA2) and CustOmics identify shared and distinct sources of variation across omics layers, often by projecting data into a lower-dimensional latent space [21, 22]. While these methods are powerful for data completion or representation learning, they rely on a single unified prediction or reconstruction model applied to fully imputed or latent data representations. hey are not designed to address test-time block-wise missingness or to leverage profile-specific ensembles for prediction.

In contrast, our Hybrid Approach maintains biological data integrity by combining the profile concept with selective, lightweight imputation. Separate models are trained for each observed missingness pattern, preserving the authenticity of available data without unnecessary completion. At test time, when a novel missing pattern appears, minimal imputation is applied solely to align the incomplete sample with one or more pre-trained available-case models. This routing mechanism enables accurate predictions across unseen configurations while avoiding full retraining, ensuring both computational efficiency and biological validity.

Materials and methods

Dataset

Most existing studies on block-wise missing data focus on datasets containing three-omics. To evaluate the applicability of our approach to more complex multi-omics scenarios, a dataset containing four-omics layers was selected: somatic mutation (mu), copy number variation (cn), RNA sequencing (rs), and phospho-protein expression (pp). The dataset includes 705 breast cancer patients (611 alive, 94 deceased) and 1,937 features distributed across four omics layers: RNA sequencing (604 features), copy number variation (860 features), somatic mutation (249 features), and phospho-protein expression (223 features), with a target variable indicating vital status (survived or died). Survival is inherently a time-to-event outcome; however, for this methodological comparison, we follow standard practice in bioinformatics by using the binary outcome provided in the dataset to enable consistent evaluation across missing-data handling strategies. The original dataset is complete with no missing values, allowing us to simulate controlled missingness scenarios. The dataset used in this study was obtained from a publicly available source [23].

The dataset was split into training and testing subsets using a stratified sampling method that preserved the original class distribution. 5% of the total patient group was set aside for external validation, ensuring proportional representation of outcome classes and enabling fair evaluation of predictive performance under clinically relevant class imbalance. Controlled patterns of missing data were then introduced separately into the training and testing subsets. Additionally, unchanged versions of the training and testing sets were kept as complete reference datasets for control comparison.

Because missing-data handling methods may interact differently with high-dimensional omics features, using a fully complete dataset allowed methodological effects to be isolated without introducing confounding sources of missingness.

Missing data scenarios

For the experiments, diverse block-wise missing-data patterns were introduced, and each approach was evaluated under five distinct missingness scenarios, illustrated in Fig. 1.

Fig. 1 — Missingness heatmaps for training and testing sets across five scenarios. Blue = missing; white = observed. Rows: samples; columns: features

Same Patterns: Identical missing data patterns were applied to both the training and external testing datasets. This illustrates an ideal scenario in which the model faces identical missing data patterns during both training and testing. Success in this context suggests good learning under predictable data missingness, however, this might not capture the variability found in real applications.
Light Generalization: Most missing-data patterns present in the training dataset were also found in the external testing dataset, though not all. This simulates situations where the model encounters a small number of previously unseen patterns during testing. The aim is to assess whether the model can adapt to mildly unfamiliar missingness structures, reflecting real-world settings in which acquisition protocols or cohort characteristics introduce modest differences in modality availability.
High Generalization: Most of the missing data patterns in the training dataset were significantly different from those in the external testing dataset which is more challenging than the previous scenario. This reflects real situations where data collection methods vary, causing highly different missing data patterns. Performance under this condition is critical for evaluating an approach’s robustness and ability to generalize beyond the patterns observed during development.
Missing Omic Simulation: A specific omics layer was consistently present in the training dataset but absent from a portion of the testing dataset. This scenario simulates situations where certain tests are not performed in follow-up studies or validation cohorts. The goal is to evaluate whether the predictive model can effectively use the remaining omics layers to compensate for the missing data. Furthermore, this scenario also offers insight into the relative predictive contribution of each omics layer: sustained performance suggests redundancy across modalities, whereas substantial degradation may indicate the critical importance of the omitted omic.
Training on Sparse Data, Testing on Rich Data: The training dataset exhibited heavy missingness, whereas the testing dataset contained more available omics layers. In this scenario, the training dataset has a lot of missing data, while the testing dataset is more complete. This scenario flips the usual problem.The aim is to see if training with sparse data affects the model’s ability to use fully available data during testing. This setup is particularly relevant for longitudinal studies, where early-phase cohorts often have fewer modalities than later ones due to evolving technologies or funding differences.

Data preprocessing and prediction model

A standardized preprocessing pipeline was used for all experimental scenarios to ensure consistency, reproducibility, and fair comparisons between models. To further enhance reliability, all experiments were repeated under 15 independent random seeds, and results were reported as mean ± standard deviation (SD) across these runs. To control for random variation within each seed, fixed random seeds were set across the operating system, NumPy, Python’s built-in random module, and TensorFlow. This multi-seed setup guaranteed both within-run reproducibility and across-run robustness, ensuring that any performance differences were due solely to the different approaches.

The preprocessing stage began with min-max normalization [24] of the feature values, bringing all omics-derived inputs into a comparable scale. This choice avoids inflation of large-valued features and supports stable neural network training across heterogeneous omics layers. The dataset included 705 patients, with 611 labeled as alive and 94 as deceased, leading to a significant class imbalance. To ensure a fair evaluation, the data were first split using a stratified hold-out scheme, with 5% of the samples reserved for testing and the remaining data used for training. This imbalance risked biased model learning, favoring the majority class. To address this, the training set was balanced by oversampling the minority class using random duplication to match the number of samples in the majority class. This produced a balanced training subset that improved sensitivity to the minority outcome and allowed the models to learn equally from both classes without sacrificing fairness or inflating the test-set performance.

Initially, several machine learning methods were tested, such as Random Forest [25], XGBoost [26], and LightGBM [27]. However, a deep learning model using Long Short-Term Memory (LSTM) [28] consistently showed the best performance for this dataset. Although LSTM networks are traditionally used for sequential data, their gated architecture can capture complex nonlinear interactions across high-dimensional omics features, which contributed to superior empirical performance. All experiments were implemented in Python using TensorFlow 2.19 and Scikit-learn. The final architecture included stacked LSTM layers with 64, 32, and 16 units, respectively, with each LSTM layer followed by a Dropout layer set to 0.2 to mitigate overfitting. A sigmoid activation function was used for the output layer. In addition to the earlier oversampling performed during preprocessing, focal loss [29] with parameters gamma = 2.0 and alpha = 0.25 was chosen as the objective function during model training to further emphasize the harder-to-classify samples. The Adam optimizer [30] with a learning rate of 0.001 was used for efficient gradient descent. The model was trained for 100 epochs with a batch size of 16, and early stopping was employed to prevent overfitting.

This carefully controlled and unified pipeline ensured that both intra-seed and inter-seed performance variations reflected true experimental differences rather than random effects or preprocessing inconsistencies. To facilitate full reproducibility, the exact software environment and package versions used in this study are provided in the accompanying repository.

Experimental design and comparison

Four experiments were conducted to compare the performance of different approaches for handling block-wise missingness in multi-omics data. Three of these experiments are based on foundational ideas from existing frameworks in the literature, excluding their more complex enhancements to allow a fair comparison with the proposed baseline approach. The goal is to introduce a new fundamental strategy that, like the others, can later be extended or optimized.

Two of the experiments are available-case approaches, meaning they do not rely on imputation but instead work around the missingness structure. In Experiment 1, models are trained only for the specific missing patterns present in the training data. Although some studies do not address how the system should respond to previously unseen missing patterns in external data, this experiment addresses the issue by training a new model whenever such a pattern is encountered.

Experiment 2 takes a more exhaustive route by training models for all possible missingness patterns—specifically, 2^S−1 models, where S is the number of omics types. This guarantees that every potential missing pattern during testing has a pre-trained model available.

Experiment 3 follows the imputation approach, applying imputation to both the training and testing datasets. To reduce randomness in the comparison, the same imputation technique which is K-Nearest Neighbors (KNN) was used, where applicable, in Experiment 4 as well.

Experiment 4 showcases our proposed approach, blending aspects of both available-case and Imputation-Based strategies. Like Experiment 1, models are trained only on the missing patterns found in the training set. However, if an unfamiliar missingness pattern emerges during testing, we apply imputation to transform the input into a compatible form, aiming to preserve biological integrity while maintaining computational efficiency.

Finally, for benchmarking purposes, the LSTM model was also applied to the uncurated dataset that contains no induced missing patterns. This baseline provides a reference for the model’s performance under ideal (fully observed) conditions.

To ensure that performance differences are attributable solely to the missing-data handling strategy, rather than to variations in the underlying prediction model, all four approaches were implemented using the same base classifier (LSTM). This deliberate design isolates the conceptual contribution of each strategy, allowing for a fair, controlled comparison.

Accordingly, the study focuses on comparing the core logic of missing-data handling rather than benchmarking against existing state-of-the-art integrative frameworks, which typically differ in model architecture, optimization objectives, and feature-level integration. Such differences would confound the interpretation of results by blending algorithmic improvements with the effects of the missing-data strategy itself.

All experiments were conducted using a fixed LSTM architecture and a frozen software environment to ensure reproducibility. Although it is not feasible to exhaustively test every possible combination of training and testing configurations across the vast landscape of block-wise missingness, the selected scenarios were intended to reflect a wide and clinically relevant range of patterns. Although alternative configurations may yield different comparative outcomes, the reported results offer strong empirical support for the effectiveness and generalizability of the proposed Hybrid Approach.

An overview of the key characteristics of all four experimental approaches is summarized in Table 1.

Table 1.

Comparison of four experimental approaches for handling block-wise missingness in multi-omics data

Aspect	Experiment 1(Dynamic Available-Case approach)	Experiment 2 (Exhaustive Available-Case approach)	Experiment 3 (Full Imputation)	Experiment 4 (Proposed Approach)
Training Patterns	Only observed patterns in training data	All 2 S − 1 possible patterns	All training data imputed	Only observed patterns in training data
Testing Pattern Handling	Train new model for unseen patterns	Always covered (pre-trained models)	Imputed	Impute unseen patterns
Imputation Used	No	No	KNN	KNN (for unseen patterns)
Imputation Scope	None	None	Full dataset (train + test)	Test set only (as needed)
Adaptivity	Reactive (train-on-demand)	Exhaustive upfront	Static	Balanced (available-case + imputation)
Computational Load	Moderate	High	Moderate	Moderate

Approach	Accuracy	Accuracy (% of Baseline Mean)	F1 Score	F1 Score (% of Baseline Mean)	Balanced Accuracy	Balanced Acc. (% of Baseline Mean)	Avg Train Time (seconds)	Avg Test Time (seconds)	Avg Total Time (seconds)
Baseline (Complete Data)	0.82	100.00%	0.25	100.00%	0.57	100.00%	--	--	--
Dynamic Available-Case	0.84 ± 0.02	102.51%	0.31 ± 0.04	119.68%	0.59 ± 0.02	103.91%	47.97 ± 3.15	18.84 ± 13.03	66.81 ± 12.26
Exhaustive Available-Case	0.84 ± 0.02	102.51%	0.31 ± 0.04	119.68%	0.59 ± 0.02	103.91%	122.11 ± 9.96	2.23 ± 0.50	124.33 ± 9.54
Imputation-Based	0.82 ± 0.02	99.77%	0.26 ± 0.06	103.48%	0.57 ± 0.03	100.48%	41.44 ± 2.87	2.23 ± 0.24	43.67 ± 2.83
Hybrid Approach	0.85 ± 0.02	103.71%	0.32 ± 0.03	123.31%	0.60 ± 0.01	104.83%	46.59 ± 1.57	2.18 ± 0.32	48.77 ± 1.51

Comparison	Metric	p-value	Effect size (r)	Holm-adjusted p
Hybrid vs. Dynamic	Accuracy	6.99 × 10⁻⁹	0.30	1.40 × 10⁻⁸
—	Balanced Accuracy	9.37 × 10⁻⁴	0.17	1.87 × 10⁻³
—	F1 Score	4.76 × 10⁻³	0.15	9.53 × 10⁻³
—	Train Time	5.72 × 10⁻¹³	−0.37	5.72 × 10⁻¹³
—	Test Time	4.78 × 10⁻⁵⁷	−0.82	1.44 × 10⁻⁵⁶
Hybrid vs. Exhaustive	Accuracy	6.99 × 10⁻⁹	0.30	1.40 × 10⁻⁸
—	Balanced Accuracy	9.37 × 10⁻⁴	0.17	1.87 × 10⁻³
—	F1 Score	4.76 × 10⁻³	0.15	9.53 × 10⁻³
—	Train Time	3.32 × 10⁻⁶³	−0.87	9.97 × 10⁻⁶³
—	Test Time	0.7305	0.02	0.7305
Hybrid vs. Imputation	Accuracy	3.58 × 10⁻⁴⁵	0.73	1.07 × 10⁻⁴⁴
—	Balanced Accuracy	2.29 × 10⁻²⁵	0.54	6.88 × 10⁻²⁵
—	F1 Score	8.76 × 10⁻²⁵	0.53	2.63 × 10⁻²⁴
—	Train Time	1.23 × 10⁻³⁹	0.68	2.46 × 10⁻³⁹
—	Test Time	4.35 × 10⁻³	−0.15	8.69 × 10⁻³

Configuration	Training Patterns	Train Distribution	Testing Patterns	Test Distribution	Rationale
1	1101, 0111, 1011, 0110	[0.10, 0.10, 0.10, 0.10]	1101, 0111, 1011, 0110	[0.10, 0.10, 0.10, 0.10]	This setup uses a uniform training distribution across four multi-omics patterns, serving as a controlled baseline to assess performance when pattern exposure is evenly balanced between training and testing.
2	1110, 0110, 1011, 1001	[0.10, 0.10, 0.10, 0.10]	1110, 0110, 1011, 1001	[0.10, 0.10, 0.10, 0.10]	A fully balanced configuration, where all patterns are equally represented during both training and testing.
3	1110, 1011, 0001,0111	[0.10, 0.10, 0.05,0.10]	1110, 1011, 0001,0111	[0.10, 0.10, 0.025,0.10]	This configuration includes the rare and extreme pattern 0001, where only one omic is observed, testing each approach’s ability to generalize from sparse and unique cases.

Approach	Accuracy	Accuracy (% of Baseline Mean)	F1 Score	F1 Score (% of Baseline Mean)	Balanced Accuracy	Balanced Acc. (% of Baseline Mean)	Avg Train Time (seconds)	Avg Test Time (seconds)	Avg Total Time (seconds)
Dynamic Available-Case	0.84 ± 0.02	102.76%	0.29 ± 0.04	114.84%	0.59 ± 0.02	102.65%	51.94 ± 4.50	2.08 ± 0.17	54.02 ± 4.58
Exhaustive Available-Case	0.84 ± 0.02	102.76%	0.29 ± 0.04	114.84%	0.59 ± 0.02	102.65%	128.21 ± 5.17	2.30 ± 0.10	130.50 ± 5.24
Imputation-Based	0.81 ± 0.02	98.10%	0.23 ± 0.04	88.02%	0.55 ± 0.02	97.06%	40.58 ± 2.83	2.18 ± 0.13	42.76 ± 2.91
Hybrid Approach	0.84 ± 0.02	102.76%	0.29 ± 0.04	114.84%	0.59 ± 0.02	102.65%	48.06 ± 2.18	1.91 ± 0.08	49.97 ± 2.23

Comparison	Metric	p-value	Effect size (r)	Holm-adjusted p
Hybrid vs. Dynamic	Accuracy	—	—	7.72 × 10⁻³
—	Balanced Accuracy	—	—	1.61 × 10⁻²
—	F1 Score	—	—	9.41 × 10⁻³
—	Train Time	6.71 × 10⁻³	−0.70	6.71 × 10⁻³
—	Test Time	8.54 × 10⁻⁴	−0.86	1.31 × 10⁻³
Hybrid vs. Exhaustive	Accuracy	—	—	7.72 × 10⁻³
—	Balanced Accuracy	—	—	1.61 × 10⁻²
—	F1 Score	—	—	9.41 × 10⁻³
—	Train Time	6.10 × 10⁻⁵	−1.04	1.83 × 10⁻⁴
—	Test Time	6.10 × 10⁻⁵	−1.04	1.83 × 10⁻⁴
Hybrid vs. Imputation	Accuracy	2.57 × 10⁻³	0.78	7.72 × 10⁻³
—	Balanced Accuracy	5.38 × 10⁻³	0.72	1.61 × 10⁻²
—	F1 Score	3.14 × 10⁻³	0.76	9.41 × 10⁻³
—	Train Time	6.10 × 10⁻⁵	1.04	1.83 × 10⁻⁴
—	Test Time	6.53 × 10⁻⁴	−0.88	1.31 × 10⁻³

Configuration	Training Patterns	Train Distribution	Testing Patterns	Test Distribution	Rationale
1	1011, 0111, 1101, 0110	[0.10, 0.10, 0.10, 0.10]	1101, 0101, 0111, 1011	[0.10, 0.10, 0.10, 0.10]	Adds one unseen-but-similar pattern (0101) to gauge responsiveness to modest novelty.
2	1110, 1101, 0111	[0.10, 0.10, 0.10]	1101, 0111, 1010,0101	[0.10, 0.10, 0.10,0.10]	Introduces two novel partial-overlap patterns (1010, 0101) posing a mild generalization challenge.
3	0101, 1011, 1100, 0001	[0.10, 0.10, 0.10,0.05]	0101, 1011, 1010	[0.10, 0.10, 0.20]	Tests generalization from sparse inputs via rare or extreme patterns.

Approach	Accuracy	Accuracy (% of Baseline Mean)	F1 Score	F1 Score (% of Baseline Mean)	Balanced Accuracy	Balanced Acc. (% of Baseline Mean)	Avg Train Time (seconds)	Avg Test Time (seconds)	Avg Total Time (seconds)
Dynamic Available-Case	0.82 ± 0.02	100.51%	0.28 ± 0.06	108.51%	0.59 ± 0.03	102.40%	48.89 ± 3.18	13.06 ± 0.66	61.95 ± 3.55
Exhaustive Available-Case	0.82 ± 0.02	100.51%	0.28 ± 0.06	108.51%	0.59 ± 0.03	102.40%	129.31 ± 6.02	1.88 ± 0.07	131.20 ± 6.07
Imputation-Based	0.81 ± 0.03	98.32%	0.22 ± 0.05	86.63%	0.55 ± 0.02	96.66%	38.59 ± 2.61	2.38 ± 0.15	40.96 ± 2.74
Hybrid Approach	0.83 ± 0.01	101.46%	0.29 ± 0.05	111.72%	0.59 ± 0.02	103.21%	47.92 ± 5.94	2.09 ± 0.06	50.01 ± 5.96

Comparison	Metric	p-value	Effect size (r)	Holm-adjusted p
Hybrid vs. Dynamic	Accuracy	4.60 × 10⁻³	0.73	1.38 × 10⁻²
—	Balanced Accuracy	1.33 × 10⁻²	0.64	2.67 × 10⁻²
—	F1 Score	4.65 × 10⁻³	0.73	1.39 × 10⁻²
—	Train Time	9.46 × 10⁻²	−0.43	9.46 × 10⁻²
—	Test Time	6.10 × 10⁻⁵	−1.04	1.83 × 10⁻⁴
Hybrid vs. Exhaustive	Accuracy	4.60 × 10⁻³	0.73	1.38 × 10⁻²
—	Balanced Accuracy	1.33 × 10⁻²	0.64	2.67 × 10⁻²
—	F1 Score	4.65 × 10⁻³	0.73	1.39 × 10⁻²
—	Train Time	6.10 × 10⁻⁵	−1.04	1.83 × 10⁻⁴
—	Test Time	6.10 × 10⁻⁵	1.04	1.83 × 10⁻⁴
Hybrid vs. Imputation	Accuracy	3.31 × 10⁻²	0.55	3.31 × 10⁻²
—	Balanced Accuracy	3.36 × 10⁻³	0.76	1.01 × 10⁻²
—	F1 Score	1.25 × 10⁻²	0.65	1.39 × 10⁻²
—	Train Time	6.10 × 10⁻⁵	1.04	1.83 × 10⁻⁴
—	Test Time	6.10 × 10⁻⁵	−1.04	1.83 × 10⁻⁴

Configuration	Training Patterns	Train Distribution	Testing Patterns	Test Distribution	Rationale
1	1110, 0111, 1011, 1101	[0.15, 0.15, 0.10, 0.10]	0001, 0010, 0100, 1010, 0011, 1001	[0.10, 0.05, 0.05, 0.10, 0.10, 0.10]	This configuration exposes the model to diverse multi-omics training patterns, then tests the model with a wide range of novel single- and dual-omics profiles in testing.
2	1001, 0111, 1010, 0001	[0.10, 0.10, 0.10, 0.10]	0011, 0010, 0101, 1110	[0.10, 0.10, 0.10, 0.10]	Tests generalization under full novelty by using non-overlapping test patterns absent from training.
3	1110, 1101, 1011, 0101	[0.10, 0.10, 0.10, 0.10]	1001, 1110, 0011, 1010, 1101, 0110	[0.10, 0.10, 0.10, 0.10, 0.10,0.10]	Blends familiar and novel patterns to challenge partial-transfer generalization while increasing distribution shift.

Approach	Accuracy	Accuracy (% of Baseline Mean)	F1 Score	F1 Score (% of Baseline Mean)	Balanced Accuracy	Balanced Acc. (% of Baseline Mean)	Avg Train Time (seconds)	Avg Test Time (seconds)	Avg Total Time (seconds)
Dynamic Available-Case	0.82 ± 0.01	100.35%	0.26 ± 0.03	102.60%	0.57 ± 0.02	100.31%	47.82 ± 2.40	37.26 ± 3.41	85.08 ± 5.72
Exhaustive Available-Case	0.82 ± 0.01	100.35%	0.26 ± 0.03	102.60%	0.57 ± 0.02	100.31%	120.50 ± 5.67	1.96 ± 0.08	122.46 ± 5.74
Imputation-Based	0.83 ± 0.03	101.73%	0.32 ± 0.04	126.04%	0.60 ± 0.02	105.36%	42.14 ± 2.09	2.58 ± 0.07	44.72 ± 2.12
Hybrid Approach	0.84 ± 0.01	102.63%	0.31 ± 0.05	120.83%	0.59 ± 0.02	104.09%	45.66 ± 1.67	2.71 ± 0.08	48.38 ± 1.69

Comparison	Metric	p-value	Effect size (r)	Holm-adjusted p
Hybrid vs. Dynamic	Accuracy	1.22 × 10⁻³	0.83	3.66 × 10⁻³
—	Balanced Accuracy	4.50 × 10⁻³	0.73	1.35 × 10⁻²
—	F1 Score	2.16 × 10⁻³	0.79	6.47 × 10⁻³
—	Train Time	1.53 × 10⁻³	−0.82	1.53 × 10⁻³
—	Test Time	6.10 × 10⁻⁵	−1.04	1.83 × 10⁻⁴
Hybrid vs. Exhaustive	Accuracy	1.22 × 10⁻³	0.83	3.66 × 10⁻³
—	Balanced Accuracy	4.50 × 10⁻³	0.73	1.35 × 10⁻²
—	F1 Score	2.16 × 10⁻³	0.79	6.47 × 10⁻³
—	Train Time	6.10 × 10⁻⁵	−1.04	1.83 × 10⁻⁴
—	Test Time	6.53 × 10⁻⁴	0.88	1.31 × 10⁻³
Hybrid vs. Imputation	Accuracy	8.20 × 10⁻¹	0.06	8.20 × 10⁻¹
—	Balanced Accuracy	3.46 × 10⁻¹	−0.24	3.46 × 10⁻¹
—	F1 Score	5.70 × 10⁻¹	−0.15	5.70 × 10⁻¹
—	Train Time	4.27 × 10⁻⁴	0.91	8.54 × 10⁻⁴
—	Test Time	4.51 × 10⁻³	0.73	4.51 × 10⁻³

Configuration	Training Patterns	Train Distribution	Testing Patterns	Test Distribution	Rationale
1	0011, 1011, 0111	[0.10, 0.10, 0.10]	1110, 0110, 1010	[0.05, 0.05, 0.05]	Trains only on profiles with pp = 1; tests on profiles where pp = 0.
2	1010, 1011, 0111, 0010	[0.10, 0.10, 0.10, 0.10]	1101, 1100	[0.15, 0.15]	Trains only on profiles with mu = 1; tests on profiles where mu = 0.
3	0100, 1110, 0101,1100	[0.10, 0.10, 0.10, 0.10]	1011, 1010, 1001, 1000	[0.10, 0.10, 0.05, 0.05]	Trains only on profiles with cn = 1; tests on profiles where cn = 0.

Approach	Accuracy	Accuracy (% of Baseline Mean)	F1 Score	F1 Score (% of Baseline Mean)	Balanced Accuracy	Balanced Acc. (% of Baseline Mean)	Avg Train Time (seconds)	Avg Test Time (seconds)	Avg Total Time (seconds)
Dynamic Available-Case	0.86 ± 0.02	104.98%	0.36 ± 0.03	139.76%	0.61 ± 0.01	107.40%	48.88 ± 3.64	24.82 ± 2.21	73.70 ± 5.65
Exhaustive Available-Case	0.86 ± 0.02	104.98%	0.36 ± 0.03	139.76%	0.61 ± 0.01	107.40%	125.76 ± 6.56	1.86 ± 0.07	127.61 ± 6.61
Imputation-Based	0.81 ± 0.02	98.19%	0.24 ± 0.04	92.01%	0.56 ± 0.02	97.58%	40.36 ± 2.50	1.94 ± 0.06	42.30 ± 2.53
Hybrid Approach	0.88 ± 0.02	106.63%	0.36 ± 0.05	142.10%	0.62 ± 0.02	108.13%	46.15 ± 1.32	2.28 ± 0.10	48.43 ± 1.34

Comparison	Metric	p-value	Effect size (r)	Holm-adjusted p
Hybrid vs. Dynamic	Accuracy	3.24 × 10⁻³	0.76	6.47 × 10⁻³
—	Balanced Accuracy	0.151	0.37	0.302
—	F1 Score	0.140	0.38	0.280
—	Train Time	1.03 × 10⁻²	−0.66	1.03 × 10⁻²
—	Test Time	6.10 × 10⁻⁵	−1.04	1.83 × 10⁻⁴
Hybrid vs. Exhaustive	Accuracy	3.24 × 10⁻³	0.76	6.47 × 10⁻³
—	Balanced Accuracy	0.151	0.37	0.302
—	F1 Score	0.140	0.38	0.280
—	Train Time	6.10 × 10⁻⁵	−1.04	1.83 × 10⁻⁴
—	Test Time	6.10 × 10⁻⁵	1.04	1.83 × 10⁻⁴
Hybrid vs. Imputation	Accuracy	6.37 × 10⁻⁴	0.88	1.91 × 10⁻³
—	Balanced Accuracy	6.52 × 10⁻⁴	0.88	1.95 × 10⁻³
—	F1 Score	8.03 × 10⁻⁴	0.87	2.41 × 10⁻³
—	Train Time	1.22 × 10⁻⁴	0.99	2.44 × 10⁻⁴
—	Test Time	6.53 × 10⁻⁴	0.88	6.53 × 10⁻⁴

Approach	Accuracy	Accuracy (% of Baseline Mean)	F1 Score	F1 Score (% of Baseline Mean)	Balanced Accuracy	Balanced Acc. (% of Baseline Mean)	Avg Train Time (seconds)	Avg Test Time (seconds)	Avg Total Time (seconds)
Dynamic Available-Case	0.85 ± 0.01	104.12%	0.33 ± 0.05	129.25%	0.61 ± 0.02	106.12%	42.84 ± 2.53	17.48 ± 0.77	60.32 ± 3.13
Exhaustive Available-Case	0.85 ± 0.01	104.12%	0.33 ± 0.05	129.25%	0.61 ± 0.02	106.12%	105.60 ± 4.26	3.07 ± 0.15	108.67 ± 4.38
Imputation-Based	0.83 ± 0.02	100.57%	0.32 ± 0.05	124.31%	0.60 ± 0.03	105.33%	46.30 ± 1.41	2.10 ± 0.05	48.40 ± 1.43
Hybrid Approach	0.86 ± 0.02	104.93%	0.32 ± 0.05	126.48%	0.60 ± 0.02	105.83%	44.41 ± 11.45	1.93 ± 0.14	46.35 ± 11.46

Comparison	Metric	n (seeds)	p-value	Effect size (r)	Holm-adjusted p
Hybrid vs. Dynamic	Accuracy	15	0.146	0.38	0.292
—	Balanced Accuracy	15	0.850	0.05	1
—	F1 Score	15	1	0	1
—	Train Time	15	0.303	−0.27	0.303
—	Test Time	15	6.10 × 10⁻⁵	−1.04	1.83 × 10⁻⁴
Hybrid vs. Exhaustive	Accuracy	15	0.146	0.38	0.292
—	Balanced Accuracy	15	0.850	0.05	1
—	F1 Score	15	1	0	1
—	Train Time	15	6.10 × 10⁻⁵	−1.04	1.83 × 10⁻⁴
—	Test Time	15	6.53 × 10⁻⁴	−0.88	1.31 × 10⁻³
Hybrid vs. Imputation	Accuracy	15	2.15 × 10⁻³	0.79	6.46 × 10⁻³
—	Balanced Accuracy	15	0.599	0.14	1
—	F1 Score	15	0.495	0.18	1
—	Train Time	15	0.035	−0.54	0.071
—	Test Time	15	1.06 × 10⁻²	−0.66	1.06 × 10⁻²

Configuration	Training Patterns	Train Distribution	Testing Patterns	Test Distribution	Rationale
1	0001, 0010, 1001, 1000	[0.10, 0.10, 0.10, 0.10]	1110, 0111	[0.15, 0.15]	Training covers one- and two-omic profiles; test probes generalization to dense three-omics combinations not seen during training.
2	0001, 0100, 1000, 0110	[0.10, 0.10, 0.10, 0.15]	1011, 1110, 0111	[0.15, 0.15, 0.15]	Training omits at least one key source each time; test samples reintroduce two new three-omic profiles.
3	0010, 0100, 0001, 1010	[0.10, 0.10, 0.10, 0.20]	1101, 1110	[0.20, 0.20]	Training covers sparse omics views; testing introduces novel multi-omic combinations, challenging the model to integrate across previously unlinked features.

PERMALINK

Profile-guided Hybrid Approach for block-wise missing data handling in multi-omics: a breast cancer case study

Esraa Hamdi Abdelaziz

Eman Amin

Rasha Ismail

Mai Mabrouk

Abstract

Background

Methods

Results

Conclusion

Background

Related work

Materials and methods

Dataset

Missing data scenarios

Fig. 1.

Data preprocessing and prediction model

Experimental design and comparison

Table 1.

Available-case approaches: profile generation and data splitting

Fig. 2.

Imputation approach: KNN imputation method

Approaches for handling missing data patterns

Fig. 3.

Experiment 1: dynamic available-case approach

Experiment 2: exhaustive available-case approach

Experiment 3: imputation-based approach

Experiment 4: the proposed Hybrid Approach

Results

Table 2.

Table 3.

Fig. 4.

Fig. 5.

Same patterns scenario

Table 4.

Table 5.

Table 6.

Light generalization scenario

Table 7.

Table 8.

Table 9.

High generalization scenario

Table 10.

Table 11.

Table 12.

Missing omic simulation

Table 13.

Table 14.

Table 15.

Training on sparse data, testing on rich data scenario

Table 17.

Table 18.

Table 16.

Discussion

Conclusion

Acknowledgements

Author contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases