Toward the best generalizable performance of machine learning in modeling omic and clinical data

Fei Deng; Yongfeng Zhang; Lanjing Zhang

doi:10.1016/j.labinv.2025.104253

. Author manuscript; available in PMC: 2025 Nov 27.

Published in final edited form as: Lab Invest. 2025 Oct 15;105(12):104253. doi: 10.1016/j.labinv.2025.104253

Toward the best generalizable performance of machine learning in modeling omic and clinical data

Fei Deng ¹, Yongfeng Zhang ², Lanjing Zhang ^1,^3,⁴

PMCID: PMC12648569 NIHMSID: NIHMS2122465 PMID: 41106592

Abstract

There are often performance differences between intra-dataset and cross-dataset tests in machine learning (ML) modeling. However, reducing these differences may reduce ML performances. It is thus a challenging dilemma for developing models that excel in intra-dataset testing and are generalizable to cross-dataset testing. Therefore, we aimed to understand and improve performance and generalizability of ML in intra-dataset and cross-dataset testing. We evaluated 4,200 ML models of classifying lung adenocarcinoma (LUAD) deaths using the The Cancer Genome Atlas (TCGA, n=286) and Oncogenomic-Singapore (OncoSG, n=167) datasets, and 1,680 models of classifying glioblastoma deaths using TCGA (n=151) and Clinical Proteomic Tumor Analysis Consortium (CPTAC, n=97) datasets. After examining performance distributions of these ML models, we applied a dual analytical framework, including statistical analyses and SHapley Additive exPlanations-based meta-analysis, to quantify factors’ importance and trace model success back to design principles. We also developed a framework to identify the best generalizable model. Strikingly, Jarque-Bera test revealed significant deviations of model performances from normality in both cancer types and testing contexts. Simple linear models with sparse feature sets consistently dominated in LUAD experiments, whereas non-linear models dominated in glioblastoma ones, suggesting that the best modeling strategy appears cancer-type/disease dependent. Importantly, both robust Analysis of Variance (ANOVA) and Kruskal–Wallis tests consistently identified differentially expressed genes as one of the most influential factors in both cancer types. The proposed multi-criteria framework successfully identified the model that achieved both the best cross-dataset performance and similar intra-dataset performance. In summary, ML performance distributions significantly deviated from normality, which motivates using both robust parametric and non-parametric statistical tests. We quantified and provided possible exploitability on the factors associated with cross-dataset performances and generalizability of ML models in two cancer types. A multi-criteria framework was developed and validated to identify the models that are accurate and consistently robust cross datasets.

Keywords: Machine Learning, Cross-dataset Generalization, Modeling Factors, SHAP, Meta-Analysis

Introduction

The advent of high-throughput sequencing technologies has revolutionized biomedical research.^{1, 2} It provides unprecedented access to transcriptomic data that holds immense potential for advancing precision medicine.^{3, 4} Across oncology, transcriptomic datasets offer a powerful resource for developing prognostic models that can predict patient survival outcomes and guide clinical decision-making.^{5, 6} Machine learning (ML) has emerged as an indispensable tool for deciphering these complex, high-dimensional datasets, leading to the development of numerous predictive models.^7–9

Despite this promise, the lack of broad generalizability remains a critical and persistent challenge that hinders the translation of these models into clinical practice.¹⁰ ML models’ superior performances often cannot be transferred to external, independent datasets.^{11, 12} This performance gap is largely attributable to technical variations, platform-specific biases, and batch effects, which are inherent to real-world, heterogeneous data.¹³ Current efforts to mitigate this issue typically fall into two parallel streams: “data-centric” approaches, which focus on data harmonization and batch effect correction (e.g., ComBat-Seq),^{14, 15} and “model-centric” approaches, which emphasize the selection of powerful or robust algorithms (e.g., extreme gradient boosting (XGB), domain adaptation).^{16, 17} However, the fundamental limitation lies in the conspicuous absence of a methodological framework to systematically evaluate their joint impact. This leaves a crucial knowledge gap regarding the complex, interactive effects among various modeling choices, such as feature selection, data normalization, and algorithm selection.¹⁸ However, this knowledge gap is rarely investigated in a systematic, integrated manner.

Therefore, we systematically investigated whether and how the interplay of the fundamental modeling factors of “data-centric” and “model-centric” paradigms^19–21 impacts the performance and cross-dataset robustness of binary classification models, using lung adenocarcinoma (LUAD) and glioblastoma (GBM) datasets as examples. The LUAD dataset consists of The Cancer Genome Atlas (TCGA)²² and Oncogenomic-Singapore (OncoSG)²³ cohorts, while the GBM dataset includes the TCGA and the Clinical Proteomic Tumor Analysis Consortium (CPTAC)²⁴ cohorts. To classify survival outcomes in LUAD and GBM, respectively, we constructed and evaluated a total of 4,200 ML models for LUAD and 1,680 for GBM by combining various feature selection methods, data normalization strategies, and ML algorithms. These models were rigorously examined in both intra-dataset and cross-dataset settings. A novel framework was also developed to identify the best performing yet generalizable ML models.

Methods

Cohorts, Data, and Preprocessing

Figure 1 provides a schematic overview of the general data preprocessing and model development pipeline that forms the basic meta-data (See baseline characteristics of the cases in the 4 datasets in Supplementary Table 1). Since the sample sizes of TCGA were larger than the independent datasets, TCGA datasets were used for model training and intra-dataset testing for both cancer types (n=286 for LUAD and n=151 for GBM). Generalizability of the selected models was evaluated using the OncoSG cohort (n=167) for LUAD cases and the CPTAC cohort (n=97) for GBM cases in cross-dataset testing, respectively. For each modeling run, the TCGA cohort was randomly split into a training set and an intra-dataset test set (80%:20% ratio). The cross-testing performance was averaged over three runs, with each on a randomly selected 90% subset of the independent cross-testing dataset.

We analyzed the gene features (RNA-seq, Fragments per kilobase of transcript per million mapped reads) and clinical features that were shared across both intra-dataset and cross-dataset settings. The raw transcriptomic data was initially Z-score normalized. Any remaining missing values were imputed using the median for the respective feature.

To select gene features for our classification models, we performed differential expression analysis on the training set using an Analysis of Variance (ANOVA) F-test to identify genes associated with survival status. Genes with low p-values were designated as Differentially Expressed Genes (DEGs), while those with high p-values, indicating minimal variation between survival groups, were designated as Non-Differentially Expressed Genes (NDEGs) and were utilized as stable internal controls for one of our normalization strategies.²⁵

The Modeling Factor Space

To systematically investigate the drivers of model performance, we defined a comprehensive space of modeling factors, that spans the entire modeling workflow, encompassing choices from both data-centric and model-centric paradigms. The key factors and their respective choices are:

Data-Centric Factors (choices related to data preparation):
- Data Normalization (norm_method): six distinct normalization strategies were applied to the initial Z-scored data: Non-Parametric Normalization (NPN), Quantile Normalization (QN), QN followed by a Z-score transformation (QNZ), Normalization using Internal Control Genes (NICG),^26–30 binarization, and direct use of the Z-scored data (raw_data).
- Feature Selection (DEG & NDEG): A wide range of feature subsets were created by selecting varying P threshold of DEGs (e.g., 0.1%, 1%, 10%) or NDEGs (e.g., 99%, 95%), allowing us to test the impact of feature space composition and dimensionality.
Model-Centric Factors (choices related to the learning algorithm):
- Algorithm (model): Five classification algorithms were employed: Support Vector Machine (SVM),³¹ Logistic Regression (LR),³² XGB,^{33, 34} Least absolute shrinkage and selection operator (LASSO),^{35, 36} and a Multilayer Perceptron (MLP).^{37, 38}
- Configuration (weights): For each algorithm, we also tested the effect of applying class weights (weights=yes/no) to address the inherent class imbalance in the survival data.
- Optimization Target (target): During hyperparameter tuning, models were optimized for either standard Accuracy (ACC) or Balanced Accuracy (BACC).

The full factorial combination of all levels across all these factors resulted in a set of 4,200 unique tuned ML models for LUAD (Supplementary Table 1). Since the cohorts within GBM were relatively balanced in sample distribution, the ‘weights’ modeling factor was not considered, resulting in a set of 1,680 tuned models.

Each of these tuned models was then systematically executed according to the predefined workflow (Figure 1) and randomly repeated five times. The execution involved training each unique model configuration on the training set and subsequently evaluating its classification performance on both the intra-dataset test set and cross-dataset test set.

Meta-Dataset Construction and Primary Evaluation Metric

The performance results from all pipeline executions were then compiled and aggregated into our final meta-dataset. This compilation was performed separately for the intra-dataset and cross-dataset results, creating two parallel meta-datasets.

While a broad suite of performance metrics was collected (including accuracy, BACC, F1-score, kappa, recall and area under the curve (AUC)). BACC was selected as the primary performance metric for this study. Given the notable class imbalance in some cohorts, the (standard) accuracy seems to be an unreliable metric. Although other robust metrics like the Matthews Correlation Coefficient (MCC) exist,^{39, 40} BACC offers a key advantage for cross-dataset evaluation: it is prevalence-independent, highly interpretable as the average of sensitivity and specificity, and consistently scaled between 0 and 1.⁴⁰ This makes it a more robust and appropriate choice for comparing model generalization performance across heterogeneous datasets with potentially different class distributions.⁴⁰

Statistical Evaluation of Modeling Factors

We first characterized the distribution of ML performances across all tuned models and then selected the appropriate statistics to analyze them. Accordingly, we performed a distributional diagnosis of BACC using the Jarque-Bera test^{41, 42} for formal inference and Quantile-Quantile (Q-Q) plots⁴³ for visual inspection. This diagnosis shows that the ML performances’ distribution significantly deviates from normality.

These results suggest that conventional parametric tests relying on strict normality assumptions (such as classical ANOVA) may yield unreliable inferences. To address this issue, and to ensure robust statistical testing, we employed two complementary approaches: (1) the non-parametric Kruskal-Wallis H-test,⁴⁴ which makes no distributional assumptions and assesses differences in median performance, and (2) a robust ANOVA with heteroscedasticity consistent (HC3) corrections, which remains parametric but adjusts for heteroscedasticity and deviations from normality while evaluating contributions to performance variance. The H-statistics and F-statistics from these two robust methods were then used to rank the global influence of each factor in both the intra-dataset and cross-dataset settings.

Meta-Analysis for Deconstructing Interactions

We applied a SHAP-based meta-analytical framework to the meta-dataset for more granular, context-dependent explanation of performance.^45–47 The XGB model was trained as a ‘meta-model’ to predict a tuned model’s ‘BA_value’ from its constituent modeling factors. XGB was chosen due to its superior ability to capture complex non-linearities and high-order interactions, robust handling of mixed data types, and compatibility with the SHAP framework.^{48, 49}

The SHAP ‘TreeExplainer’ framework was then applied to this meta-model.⁵⁰ This approach provides instance-level attributions (SHAP values) for each factor choice in every tuned model, visualizes the entire distribution of these interaction effects via faceted plots, and can help understand the complete decision logic of modeling strategies with a previously unattainable level of granularity.

A Multi-Criteria Framework for Evaluating Model Trade-offs

Based on the distribution and SHAP analyses, we propose a novel framework that combines both key performance metric (BACC chosen here) and statistical significance in the differences between intra-dataset and cross-dataset testing. It includes three steps, namely baseline performance, generalization ability, and cross-dataset robustness.

Step 1: Identify the models with high baseline-performance.

For each modeling strategy, we conducted multiple independent experiments on both intra- and cross-dataset test sets to obtain a series of BACC values. Five random repeats were conducted for intra-dataset testing of each tuned model and 15 (5 × 3) random repeats for cross-dataset testing of each tuned model in this study. Based on the means of these BACC values, strategies were ranked to identify the high-performing models (e.g., within the top 10 percentile of the ranking) in both intra- and cross-dataset testing. These models were included in the candidate strategy pool and defined as candidate models.
2. Step 2: Assess generalizability of the candidate models and select the most robust strategies.

The Shapiro-Wilk test^{51, 52} was then employed to assess whether the BACC values of a candidate model obtained from its repeats in the intra- and cross-dataset testing, respectively, followed a normal distribution. If >66% of the candidate models’ repeats fit a normal distribution, an independent Student’s t-test would be used to calculate the p values between the intra- and cross-dataset results for candidate strategies; otherwise, the non-parametric Mann-Whitney U test was employed.^52–54 Only the strategies exhibiting no statistically significant difference (p > 0.05) between the two settings are retained for the final ranking in Step 3.
3. Step 3: Rank selected candidate models by cross-dataset testing performance.

Final candidates are ranked by their mean BACC on the cross-dataset evaluation. The top-ranked strategy that passes all three stages will be designated as the champion model for its robustness and generalization ability.

Results

Overall Model Performance and Distributional Properties

For LUAD, the 4,200 tuned models yielded a wide spectrum of performance, with BACC fluctuating across a broad range (Supplementary Figure 1). The Jarque-Bera test confirmed significant deviations from normality for both intra-dataset (JB-statistic = 1160.62, p = 9.416E-253) and cross-dataset (JB-statistic = 10838.24, p < 1E-308) results. This was visually supported by Q-Q plots, which, while showing an approximate adherence to normality in the central region, revealed significant deviations in the tails (Figure 2). These findings underscored the necessity of employing robust statistical methods for all subsequent analyses.

Figure 2: — Q-Q plots show the residuals from model performances of intra-dataset and cross-dataset testing in classifying deaths of lung adenocarcinoma and glioblastoma patients, respectively. BA, balanced accuracy; GBM, glioblastoma; LUAD, lung adenocarcinoma.

Similarly, for GBM, the Jarque–Bera test indicated non-normality for both intra-dataset (JB = 22.52, p = 1.288 E-5) and cross-dataset (JB = 217.13, p = 7.088 E-48) results, consistent with the Q–Q patterns in Figure 2.

Statistical Analysis Reveals Different Factor Importance

Robust ANOVA and the non-parametric Kruskal–Wallis test offered complementary perspectives on modeling factor importance insights (Table 1 and Supplementary Figure 2).

Table 1.

Comparison of Factor Importance Rankings from Robust ANOVA and Kruskal-Wallis Test

Evaluation Setting		Modeling Factor	Robust ANOVA		Kruskal-Wallis Test
Evaluation Setting		Modeling Factor	F-statistic	p-value	H-statistic	p-value
LUAD	Intra-dataset	NDEG	433.789	<1E-308	1453.109	<1E-308
		DEG	240.334	3.34E-298	1299.939	1.12E-277
		model	346.706	1.60E-289	1025.459	1.08E-220
		target	764.628	2.43E-165	674.217	1.21E-148
		norm_method	121.709	2.10E-127	462.665	9.10E-98
		weights	159.002	2.53E-36	113.601	1.59E-26
	Cross-dataset	model	2142.627	<1E-308	6281.526	<1E-308
		weights	505.046	2.09E-111	216.439	5.41E-49
		DEG	742.865	<1E-308	4279.915	<1E-308
		target	686.834	1.41E-150	83.833	5.38E-20
		norm_method	156.706	4.65E-166	609.466	1.82E-129
		NDEG	57.968	6.43E-49	404.379	3.15E-86
GBM	Intra-dataset	DEG	321.682	<1E-308	1511.533	<1E-308
		NDEG	157.820	1.63E-99	481.439	5.02E-104
		model	57.295	8.98E-48	191.451	2.59E-40
		norm_method	56.312	8.83E-58	240.297	6.63E-50
		target	16.762	4.28E-05	16.851	4.04E-05
	Cross-dataset	DEG	681.202	<1E-308	3647.641	<1E-308
		norm_method	464.307	<1E-308	1963.245	<1E-308
		NDEG	203.583	1.78E-130	653.085	3.12E-141
		model	83.385	1.85E-70	278.130	5.64E-59
		target	71.573	2.81E-17	89.518	3.04E-21

Open in a new tab

DEG: Differentially Expressed Genes; NDEG: Non-Differentially Expressed Genes; norm_method: normalization method; ANOVA: Applied Analysis of Variance; GBM, glioblastoma; LUAD, lung adenocarcinoma.

The bolded are the top 2 ranked factors.

For LUAD, the results revealed two key insights. First, there were significant differences in modeling factor importance between cross- and intra-dataset testing. In the intra-dataset testing, factors related to model configuration and data processing (namely, `target` or ǸDEG`-based feature selection) were the most influential. In the cross-dataset setting, however, `model` was the overwhelmingly dominant factor. Concurrently, the importance of `target` and ǸDEG` decreased.

Second, the two robust statistical methods yielded slightly different top-ranked factors in the intra-dataset testing (Table 1). Robust ANOVA, which assesses contribution to performance variance,⁵⁵ identified `target` as most influential (F = 764.628), while the Kruskal-Wallis test, which is more sensitive to shifts in the rank distribution,⁴⁴ ranked ǸDEG` as dominant (H = 1453.109). This highlights the limitations of relying solely on high-level importance rankings and motivates our subsequent, deeper investigation using SHAP meta-analysis to deconstruct the underlying mechanisms. Interestingly, the two agreed on the top 2 influencing factors in the cross-dataset testing, including model and DEG. But the 3^rd and 4^th ranked factors differed by the statistical method.

For GBM, by contrast, the two statistical methods showed strong concordance (Table 1). In the cross-dataset setting, both statistical methods highlighted DEG and norm_method as the most influential factors. Their intra-dataset rankings also largely overlapped, underscoring the robustness of these findings. Although there were minor differences in the ranking of factor importance between intra- and cross-dataset settings, DEG consistently ranked first by both statistical methods and testing setings.

Meta-Analysis: Deconstructing the Drivers of Generalizability

To further investigate the differences in factor importance rankings derived from the Robust ANOVA and Kruskal–Wallis tests, we analyzed the SHAP values and their first-order interactions for the top 30 most impactful factors, using LUAD as an example (Figure 3). In both intra- and cross-dataset testing, factor values related to algorithm choice—particularly model_LR and several LASSO-based configurations such as model_LASSO_BACC_W—consistently ranked high. This indicated that, at least in LUAD, the choice of a simpler linear algorithm was a primary determinant of generalization success. Further analysis of the interaction between algorithm choice (model) and the predictive feature space (DEG selection) revealed a robust and stable path to cross-dataset performance: LR with highly sparse DEG sets (e.g., p ≤ 0.01) consistently outperformed other strategies, with LASSO providing comparable stability. These findings reinforce the principle of parsimony in LUAD, where simpler models combined with carefully pruned feature sets reduce overfitting and capture generalizable biological signals.

Figure 3: — Violin plots of top 30 SHAP-ranked factor values across intra-dataset (TCGA) and cross-dataset (OncoSG) testing results in classifying deaths of lung adenocarcinoma patients. SHAP, SHapley Additive exPlanations; TCGA, The Cancer Genome Atlas; OncoSG, the Oncogenomic-Singapore; LUAD, lung adenocarcinoma.

Supplementary Figures 3–8 further illustrate how data processing and model configuration influence model generalizability in LUAD. Unlike the relatively clear advantage of linear models, no single normalization method consistently outperformed the others. As for model configuration, applying class weights to address data imbalance provided minimal benefit, whereas selecting an appropriate optimization target (e.g., ‘BACC’) contributed meaningfully to improving cross-dataset robustness.

Strikingly, a different pattern emerged in GBM experiments. The SHAP analysis fully aligned with both Robust ANOVA and Kruskal–Wallis rankings, further confirming that DEG selection was the most important modeling factor across both intra- and cross-dataset testing. Importantly, the SHAP plots (Supplementary Figures 9) revealed that DEG thresholds at 4% and 10% provided consistent, positive contributions to model performance, serving as two stable critical thresholds for feature-space definition. At the algorithmic level, non-linear models such as MLP and SVM ranked top by performance, surpassing linear models in stability and robustness. This suggests that, unlike LUAD, GBM may require more expressive architectures to capture the complex biological and clinical feature interactions underlying its predictive space.

Taken together, these results demonstrate that, while LUAD benefits most from parsimonious linear models paired with sparse DEG sets, GBM highlights the central role of DEG-based feature-space construction and the need for more flexible, non-linear modeling strategies.

Identifying Elite Strategies with the Multi-Criteria Framework

Finally, we implemented a multi-criteria framework to identify the best-performance yet generalizable strategies for both LUAD and GBM in three steps.

For LUAD, we first applied BA_value-based filtering criteria to 4,200 tuned models. In line with common practices for identifying top contenders in high-throughput exploratory analyses⁵⁶, we created a candidate pool of 99 models that simultaneously ranked within the top 10% by BA_value across intra-dataset and cross-dataset testing. Second, because only 28.5% of the models’ BA_value repeats showed normal distribution (Shapiro–Wilk test, p>0.05) across intra- and cross-dataset settings, we applied the non-parametric Mann-Whitney U test to identify a list of 13 elite models (Table 2), that had no significant performance degradation/differences between intra- and cross-dataset testing (p>0.05). Finally, the list was ranked by the cross-dataset performance.

Table 2.

13 elite strategies selected by the Multi-Criteria Framework on Mann-Whitney U test (LUAD)

Target	model	P threshold for DEG	P threshold for NDEG	Norm-method	weights	Intra-dataset		Cross-dataset		p_value (intra vs cross)
Target	model	P threshold for DEG	P threshold for NDEG	Norm-method	weights	mean	Std.	mean	Std.	p_value (intra vs cross)
BACC	LR	0.8%	99%	QN	no	0.732	0.119	0.635	0.036	0.116
BACC	LASSO	1%	99%	NPN	yes	0.735	0.153	0.617	0.049	0.190
BACC	LASSO	0. 8%	99%	NPN	yes	0.722	0.158	0.616	0.039	0.662
BACC	LR	0. 1%	98%	binary	yes	0.710	0.091	0.612	0.025	0.055
BACC	LR	0.8%	99.5%	NICG	yes	0.729	0.137	0.611	0.031	0.256
BACC	LASSO	0.7%	98%	binary	yes	0.741	0.102	0.609	0.023	0.055
BACC	LR	0.4%	99%	binary	yes	0.721	0.132	0.605	0.027	0.055
BACC	LR	0.1%	92%	NICG	yes	0.721	0.138	0.603	0.022	0.055
BACC	LR	1%	99%	NPN	yes	0.711	0.125	0.596	0.060	0.055
BACC	LR	0.4%	98%	QN	no	0.717	0.178	0.589	0.063	0.119
BACC	LASSO	0.1%	99%	NPN	yes	0.761	0.162	0.578	0.013	0.055
BACC	LASSO	1%	95%	raw_data	yes	0.717	0.132	0.575	0.020	0.055
BACC	LR	0.4%	98%	QN	yes	0.715	0.105	0.623	0.042	0.097

Open in a new tab

P threshold for DEG : select the DEGs by P<threshold; P threshold for NDEG: select the NDEGs by P>threshold; Norm-method: Normalization method; Std.: Standard Deviation; Intra-dataset: intra-dataset testing results; Cross-dataset: cross-dataset testing results; p_value (intra vs cross) : p-value from an Mann-Whitney U test on these selected models’ performances across intra-dataset and cross-dataset testing results; BACC: Balanced Accuracy; LR: Logistic Regression; LASSO: Least absolute shrinkage and selection operator; NPN: Non-Parametric Normalization; QN: Quantile Normalization; NICG: Normalization using Internal Control Genes; LUAD: lung adenocarcinoma.

The best/preferred model is highlighted in bold.

We then sought to validate whether these performance-driven selections were aligned with our explainability-based insights. Interestingly, the vast majority of the 13 preferred models shared the core design principles that SHAP had independently identified as drivers of robust generalizability: they predominantly employed LR or LASSO, universally used BACC as the optimization target, and consistently favored highly sparse DEG feature sets. This remarkable alignment provides a powerful cross-validation between our performance-based filtering and our interpretability analysis, confirming the internal consistency and effectiveness of our entire evaluation framework.

For GBM, we evaluated 1,680 tuned models. Based on BACC filtering, we identified a candidate pool of 71 models. Since 81.9 % of the models’ BA_value repeats did not reject the null hypothesis of normality (Shapiro–Wilk test, p > 0.05), we employed an independent-samples Student’s t-test to compare intra- and cross-dataset performance, and finally selected models with no significant cross-dataset performance-degradation (p > 0.05). The 17 elite models were ultimately selected (Table 3). The dominant roles of DEG and NDEG exhibited strong consistency with both the statistical analyses and the SHAP results (Supplementary Figure 9).

Table 3.

17 elite strategies selected by the Multi-Criteria Framework on Independent Student’s t-test (GBM)

Target	model	P threshold for DEG	P threshold for NDEG	Norm-method	Intra-dataset		Cross-dataset		p_value (intra vs cross)
Target	model	P threshold for DEG	P threshold for NDEG	Norm-method	mean	Std.	mean	Std.	p_value (intra vs cross)
ACC	MLP	4%	99.5%	NICG	0.690	0.111	0.648	0.027	0.446
BACC	MLP	4%	99.5%	NICG	0.710	0.098	0.632	0.029	0.151
BACC	SVM	4%	99%	NPN	0.714	0.103	0.612	0.037	0.092
BACC	SVM	0.8%	99.5%	NICG	0.698	0.109	0.611	0.031	0.148
BACC	LR	4%	99.5%	raw_data	0.684	0.100	0.610	0.039	0.177
ACC	LR	4%	99.5%	NICG	0.705	0.102	0.610	0.031	0.103
ACC	MLP	4%	95%	NPN	0.707	0.104	0.606	0.034	0.096
BACC	LR	4%	98%	raw_data	0.690	0.134	0.599	0.033	0.204
BACC	MLP	4%	95%	NPN	0.691	0.109	0.598	0.034	0.131
BACC	LR	4%	99.5%	NICG	0.695	0.108	0.598	0.030	0.115
ACC	SVM	4%	99%	NPN	0.696	0.089	0.597	0.034	0.067
ACC	SVM	4%	98%	NPN	0.692	0.089	0.596	0.046	0.073
ACC	LR	4%	98%	raw_data	0.685	0.130	0.596	0.025	0.199
BACC	SVM	10%	95%	NICG	0.688	0.089	0.594	0.029	0.077
BACC	MLP	10%	99.5%	NPN	0.705	0.116	0.594	0.022	0.099
ACC	LR	4%	99%	raw_data	0.705	0.118	0.592	0.019	0.099
BACC	SVM	4%	99.5%	NICG	0.710	0.150	0.592	0.037	0.155

Open in a new tab

P threshold for DEG : select the DEGs by P<threshold; P threshold for NDEG: select the NDEGs by P>threshold; Norm-method: Normalization method; Std.: Standard Deviation; Intra-dataset: intra-dataset testing results; Cross-dataset: cross-dataset testing results; p_value (intra vs cross) : p-value from an independent-sample Student’s t-test on these selected models’ performances across intra-dataset and cross-dataset testing results; BACC: Balanced Accuracy; ACC: standard Accuracy; LR: Logistic Regression; MLP: Multilayer Perceptron; SVM: Support Vector Machine; NPN: Non-Parametric Normalization; NICG: Normalization using Internal Control Genes; GBM: glioblastoma.

The best/preferred model is highlighted in bold.

Discussion

By systematically evaluating two independent examples (4,200 LUAD and 1,680 GBM tuned models) and applying a SHAP-based meta-analytical framework, we demonstrate that the “important” modeling factors shift dramatically when moving from intra-dataset to cross-dataset testing. This analysis shows that a model’s cross-dataset performance is not always determined by fixed ranking of modeling choices, but by the interplay among factors under varying data or testing contexts. This insight challenges conventional model development and evaluation practices and motivates us to establish a new, integrated framework for building the best generalizable cross-dataset ML models from real-world, heterogeneous data.

This study seems to have several scientific contributions. First, we challenge the conventional reliance on intra-dataset testing with robust empirical evidence for the “under-specification” phenomenon in bioinformatics, demonstrating that high in-distribution performance is a poor proxy for out-of-distribution (e.g., cross-dataset) generalizability.⁵⁷

Second, we are the first, to our knowledge, to systematically examine the ML performance distributions arising from different modeling factors in a meta-analytical context. After confirming their non-normal nature, we employed a robust toolkit of parametric and non-parametric tests,⁵⁸ to credibly establish that the relative importance of different modeling factors shifts dramatically depending on the evaluation context (i.e., intra-dataset vs. cross-dataset).

Third, inspired by recent works on explaining hyperparameter optimization,^{45–47, 59} we pioneer a novel meta-analytical application paradigm using SHAP.⁴⁹ While traditional tests ranked factor importance, they failed to explain why a model succeeded or failed under specific conditions.⁶⁰ To address this failure, our approach can attribute a model’s cross-dataset generalization performance to the full spectrum of upstream modeling decisions, including algorithm choice, feature selection, and data normalization.⁶¹ It also appears to move significantly beyond prior analyses that mostly focused on single-model hyperparameters.⁵⁹

Fourth, the convergence of statistical and SHAP analyses shows that the optimal modeling strategy appears cancer-type dependent. In LUAD, the choice of algorithm was decisive. Simple linear models (e.g., LR and LASSO) combined with sparse predictive feature sets consistently outperformed complex non-linear models in cross-dataset generalization in LUAD experiments. By contrast, feature-space definition emerged as dominant in GBM experiments. The use of DEG-based feature sets (particularly at 4% and 10% thresholds) showed the strongest contributions to performance. Interestingly, this differs from our prior breast cancer study using paired RNA-seq and microarray data, in which the normalization method itself played a substantial role.²⁵ Together, while ML model (e.g., linear versus non-linear), feature selection, and normalization are all critical, the key factor(s) associated with better generalizability may differ by disease context and/or dataset characteristics.

Finally, based on these insights, we propose a novel, multi-criteria framework for cross-dataset model selection that achieves an optimal balance between performance, cross-dataset generalization, and decision logic stability. In this framework, selection of the appropriate statistical test for evaluating model generalizability should be based on the model-performance distribution (normal vs not). Our empirical comparison (Tables 2 versus Supplementary Table 2, and Tables 3 versus Supplementary Table 3) revealed greater consistency in the results of the two tests for GBM. This aligns with statistical literature indicating that parametric and non-parametric tests yield similar outcomes when data approximate normality.^{24, 52–54} Although established statistical principles indicate that parametric tests (e.g., Student’s t-test) are more powerful under normality than non-normality, non-parametric tests (e.g., Mann-Whitney U) offer robustness when normality assumptions are violated.^{52–54, 62, 63} We found that the best models selected by both parametric and non-parametric tests were identical in both LUAD and GBM experiments. This consistency further validates the robustness of our adaptive testing strategy. This framework thus appears to offer a practical pipeline for developing the best generalizable ML models in cross-dataset setting.

While this study provides a generalizable methodology and actionable insights in cross-dataset ML performance, it has several limitations. First, our analysis was conducted on LUAD and GBM datasets, and extending this work to additional cancer/disease types and across more diverse populations will be critical to assess its generalizability. Second, while our modeling space was broad, our works were limited to commonly used ML models. Therefore, future work should incorporate more theoretically robust algorithms such as invariant risk minimization⁶⁴ and Group Distributionally Robust Optimization⁶⁵, as well as modern approaches including self-supervised models (e.g., contrastive learning⁶⁶) and foundation models (e.g., large-scale pre-trained omics models⁶⁷). Third, because DEGs were selected based on ANOVA’s p-value thresholds and the training sets varied in each experiment, the resulting gene lists varied to some extent. Thus, future works are warranted to investigate the impact of a fixed set of genes versus dynamically selected genes on model performance, and their clinical implications. Finally, we focused on BACC as the performance metric. Incorporating other complementary metrics such as accuracy could provide a more comprehensive evaluation,⁶⁸ though additional performance metric(s) may also complicate or skew the evaluation/ranking. Despite these limitations, the framework we propose is readily transferable to other omics domains and disease types, providing a simple yet robust path toward the best generalizable ML models.

Through systematic, multi-level analyses of ML performances in classifying survivals of two cancer/disease types, we provide new empirical evidence for understanding and optimizing the cross-dataset generalization of ML models on classifying cancer survival using transcriptomic and clinical data. We also developed and validated a multi-criteria framework for model selection that offers a transparent, data-driven approach to identify models that are not just accurate, but consistently robust cross datasets. Our data suggest that researchers should move beyond the sole pursuit of high performance within a single, intra-dataset testing context, and also focus on generalizability, explainability, and cross-dataset performance of ML modeling.

Supplementary Material

Supplementary material

NIHMS2122465-supplement-Supplementary_material.pdf^{(2.7MB, pdf)}

Funding

This work was supported by the National Cancer Institute, National Institutes of Health (R37CA277812 to LZ).

Footnotes

Conflicts of Interest

The authors declare no conflicts of interest, except that Dr. Zhang serves as a Senior Associate Editor and Dr. Deng serves as an Editorial Board Member of the journal, Laboratory Investigation.

Compliance with ethical standards

This exempt study using publicly available de-identified data did not require an IRB review.

Data Availability Statement

The data sets used and/or analyzed of this study are available on the cBioPortal website (https://www.cbioportal.org/). The program coding is available from the corresponding authors on reasonable request.

References

1.Consortium ITP-CAoWG. Pan-cancer analysis of whole genomes. Nature. 2020;578(7793):82–93. 10.1038/s41586-020-1969-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Liu DD, Zhang L. Trends in the characteristics of human functional genomic data on the gene expression omnibus, 2001–2017. Lab Invest. 2019;99(1):118–127. 10.1038/s41374-018-0125-5. [DOI] [PubMed] [Google Scholar]
3.Zhang Y, Wang D, Peng M, et al. Single-cell RNA sequencing in cancer research. Journal of Experimental & Clinical Cancer Research. 2021;40:1–17. 10.1186/s13046-021-01874-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Rodon J, Soria J-C, Berger R, et al. Genomic and transcriptomic profiling expands precision cancer medicine: the WINTHER trial. Nature medicine. 2019;25(5):751–758. 10.1038/s41591-019-0424-4. [DOI] [Google Scholar]
5.Sun S, Guo W, Wang Z, et al. Development and validation of an immune-related prognostic signature in lung adenocarcinoma. Cancer Medicine. 2020;9(16):5960–5975. 10.1002/cam4.3240. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Liang J-y, Wang D-s, Lin H-c, et al. A novel ferroptosis-related gene signature for overall survival prediction in patients with hepatocellular carcinoma. International journal of biological sciences. 2020;16(13):2430. 10.7150/ijbs.45050. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616(7956):259–265. 10.1038/s41586-023-05881-4. [DOI] [PubMed] [Google Scholar]
8.Swanson K, Wu E, Zhang A, Alizadeh AA, Zou J. From patterns to patients: Advances in clinical machine learning for cancer diagnosis, prognosis, and treatment. Cell. 2023;186(8):1772–1791. 10.1016/j.cell.2023.01.035. [DOI] [PubMed] [Google Scholar]
9.Tran KA, Kondrashova O, Bradley A, Williams ED, Pearson JV, Waddell N. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome medicine. 2021;13:1–17. 10.1186/s13073-021-00968-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Wafi A, Mirnezami R. Translational–omics: Future potential and current challenges in precision medicine. Methods. 2018;151:3–11. 10.1016/j.ymeth.2018.05.009. [DOI] [PubMed] [Google Scholar]
11.Qian Y, Daza J, Itzel T, et al. Prognostic cancer gene expression signatures: current status and challenges. Cells. 2021;10(3):648. 10.3390/cells10030648. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Deng F, Zhang L. Association of normalization, non-differentially expressed genes and data source with machine learning performance in intra-dataset or cross-dataset modelling of transcriptomic and clinical data. arXiv preprint arXiv:250218888. 2025. [Google Scholar]
13.Leek JT, Scharpf RB, Bravo HC, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics. 2010;11(10):733–739. 10.1038/nrg2825. [DOI] [Google Scholar]
14.Corchete LA, Rojas EA, Alonso-López D, De Las Rivas J, Gutiérrez NC, Burguillo FJ. Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis. Scientific reports. 2020;10(1):19737. 10.1038/s41598-020-76881-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Zhang Y, Parmigiani G, Johnson WE. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR genomics and bioinformatics. 2020;2(3):lqaa078. 10.1093/nargab/lqaa078. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Ma B, Yan G, Chai B, Hou X. XGBLC: an improved survival prediction model based on XGBoost. Bioinformatics. 2022;38(2):410–418. 10.1093/bioinformatics/btab675. [DOI] [PubMed] [Google Scholar]
17.Orouji S, Liu MC, Korem T, Peters MA. Domain adaptation in small-scale and heterogeneous biological datasets. Science Advances. 2024;10(51):eadp6040. 10.1126/sciadv.adp6040. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nature genetics. 2019;51(1):12–18. 10.1038/s41588-018-0295-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Hamid OH. From model-centric to data-centric AI: A paradigm shift or rather a complementary approach? 2022 8th International Conference on Information Technology Trends (ITT). 2022:196–199. 10.1109/ITT56123.2022.9863935. [DOI] [Google Scholar]
20.Zha D, Bhat ZP, Lai K-H, Yang F, Hu X. Data-centric ai: Perspectives and challenges. Proceedings of the 2023 SIAM international conference on data mining (SDM). 2023:945–948. 10.1137/1.9781611977653.ch106. [DOI] [Google Scholar]
21.Zha D, Bhat ZP, Lai K-H, et al. Data-centric Artificial Intelligence: A Survey. ACM Computing Surveys. 2025;57(5):1–42. 10.1145/3711118. [DOI] [Google Scholar]
22.Hoadley KA, Yau C, Hinoue T, et al. Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell. 2018;173(2):291–304 e296. 10.1016/j.cell.2018.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Chen J, Yang H, Teo ASM, et al. Genomic landscape of lung adenocarcinoma in East Asians. Nat Genet. 2020;52(2):177–186. 10.1038/s41588-019-0569-6. [DOI] [PubMed] [Google Scholar]
24.Wang L-B, Karpova A, Gritsenko MA, et al. Proteogenomic and metabolomic characterization of human glioblastoma. Cancer cell. 2021;39(4):509–528. e520. 10.1016/j.ccell.2021.01.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Deng Fei, Feng Catherine H., Gao Nan, Normalization Zhang L. and Selecting Non-Differentially Expressed Genes Improve Machine Learning Modelling of Cross-Platform Transcriptomic Data. Transactions on Artificial Intelligence. 2025;1(1):5. 10.53941/tai.2025.100005. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Foltz SM, Greene CS, Taroni JN. Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously. Communications Biology. 2023;6(1):222. 10.1038/s42003-023-04588-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Thompson J, Tan J, Greene C. Cross-platform normalization of microarray and RNA-seq data for machine learning applications. 2016. 10.1038/s42003-023-04588-6. [DOI] [Google Scholar]
28.Mansour M Non-parametric statistical test for testing exponentiality with applications in medical research. Statistical Methods in Medical Research. 2020;29(2):413–420. 10.1177/0962280218824979. [DOI] [PubMed] [Google Scholar]
29.Wu X, Geng Z, Zhao Q. Non-parametric statistics 2017. [Google Scholar]
30.Vandesompele J, De Preter K, Pattyn F, et al. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome biology. 2002;3:1–12. 10.1186/gb-2002-3-7-research0034. [DOI] [Google Scholar]
31.Noble WS. What is a support vector machine? Nature biotechnology. 2006;24(12):1565–1567. 10.1038/nbt1206-1565. [DOI] [Google Scholar]
32.Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied logistic regression: John Wiley & Sons; 2013. [Google Scholar]
33.Ma B, Meng F, Yan G, Yan H, Chai B, Song F. Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data. Computers in biology and medicine. 2020;121:103761. 10.1016/j.compbiomed.2020.103761. [DOI] [PubMed] [Google Scholar]
34.Sheridan RP, Wang WM, Liaw A, Ma J, Gifford EM. Extreme gradient boosting as a method for quantitative structure–activity relationships. Journal of chemical information and modeling. 2016;56(12):2353–2360. 10.1021/acs.jcim.6b00591. [DOI] [PubMed] [Google Scholar]
35.Roth V The generalized LASSO. IEEE transactions on neural networks. 2004;15(1):16–28. 10.1109/TNN.2003.809398. [DOI] [PubMed] [Google Scholar]
36.Chen M, Li P, Yao H, et al. Development and validation of a novel defined mutation classifier based on Lasso logistic regression for predicting the overall survival of immune checkpoint inhibitor therapy in renal cell carcinoma. Translational Andrology and Urology. 2023;12(3):406. 10.21037/tau-23-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Dunne RA. A statistical approach to neural networks for pattern recognition: John Wiley & Sons; 2007. [Google Scholar]
38.Karthik S, Sudha M. A survey on machine learning approaches in gene expression classification in modelling computational diagnostic system for complex diseases. International Journal of Engineering and Advanced Technology. 2018;8(2):182–191. [Google Scholar]
39.Deng F, Zhao L, Yu N, Lin Y, Zhang L. Union With Recursive Feature Elimination: A Feature Selection Framework to Improve the Classification Performance of Multicategory Causes of Death in Colorectal Cancer. Lab Invest. 2024;104(3):100320. 10.1016/j.labinv.2023.100320. [DOI] [PubMed] [Google Scholar]
40.Chicco D, Tötsch N, Jurman G. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData mining. 2021;14:1–22. 10.1186/s13040-021-00244-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Khatun N Applications of normality test in statistical analysis. Open journal of statistics. 2021;11(01):113. 10.4236/ojs.2021.111006. [DOI] [Google Scholar]
42.Thadewald T, Büning H. Jarque–Bera test and its competitors for testing normality–a power comparison. Journal of applied statistics. 2007;34(1):87–105. 10.1080/02664760600994539. [DOI] [Google Scholar]
43.Guzik P, Więckowska B. Data distribution analysis–a preliminary approach to quantitative data in biomedical research. Journal of Medical Science. 2023;92(2):e869–e869. 10.20883/medical.e869. [DOI] [Google Scholar]
44.MacFarland TW, Yates JM, MacFarland TW, Yates JM. Kruskal–Wallis H-test for oneway analysis of variance (ANOVA) by ranks. Introduction to nonparametric statistics for the biological sciences using R. 2016:177–211. 10.1007/978-3-319-30634-6_6. [DOI] [Google Scholar]
45.Li Z Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Computers, Environment and Urban Systems. 2022;96. 10.1016/j.compenvurbsys.2022.101845. [DOI] [Google Scholar]
46.Lundberg SM, Erion G, Chen H, et al. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat Mach Intell. 2020;2(1):56–67. 10.1038/s42256-019-0138-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Van den Broeck G, Lykov A, Schleich M, Suciu D. On the tractability of SHAP explanations. Journal of Artificial Intelligence Research. 2022;74:851–886. 10.1613/jair.1.13283. [DOI] [Google Scholar]
48.Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016:785–794. 10.1145/2939672.2939785. [DOI] [Google Scholar]
49.Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017;30. 10.48550/arXiv.1705.07874. [DOI] [Google Scholar]
50.Kumar IE, Venkatasubramanian S, Scheidegger C, Friedler S. Problems with Shapley-value-based explanations as feature importance measures. International conference on machine learning. 2020:5491–5500. 10.48550/arXiv.2002.11097. [DOI] [Google Scholar]
51.Razali NM, Wah YB. Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. Journal of statistical modeling and analytics. 2011;2(1):21–33. [Google Scholar]
52.Chicco D, Sichenze A, Jurman G. A simple guide to the use of Student’s t-test, Mann-Whitney U test, Chi-squared test, and Kruskal-Wallis test in biostatistics. BioData Mining. 2025;18(1):56. 10.1186/s13040-025-00465-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Sedgwick P A comparison of parametric and non-parametric statistical tests. BMJ. 2015;350. 10.1136/bmj.h2053. [DOI] [Google Scholar]
54.Fagerland MW. t-tests, non-parametric tests, and large studies—a paradox of statistical practice? BMC Medical Research Methodology. 2012;12(1):78. 10.1186/1471-2288-12-78. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Hollander M, Wolfe DA, Chicken E. Nonparametric statistical methods: John Wiley & Sons; 2013. [Google Scholar]
56.Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. bioinformatics. 2007;23(19):2507–2517. 10.1093/bioinformatics/btm344. [DOI] [PubMed] [Google Scholar]
57.D’Amour A, Heller K, Moldovan D, et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research. 2022;23(226):1–61. 10.48550/arXiv.2011.03395. [DOI] [Google Scholar]
58.Demšar J Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research. 2006;7(Jan):1–30. 10.17094/ataunivbd.876777. [DOI] [Google Scholar]
59.Moosbauer J, Herbinger J, Casalicchio G, Lindauer M, Bischl B. Explaining hyperparameter optimization via partial dependence plots. Advances in neural information processing systems. 2021;34:2280–2291. 10.48550/arXiv.2111.04820. [DOI] [Google Scholar]
60.Montgomery DC. Design and analysis of experiments: John wiley & sons; 2017. [Google Scholar]
61.Fisher A, Rudin C, Dominici F. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research. 2019;20(177):1–81. 10.48550/arXiv.1801.01489. [DOI] [Google Scholar]
62.Bebchuk J, Wittes J. Fundamentals of biostatistics: Cambridge University Press; 2012. [Google Scholar]
63.Eltas Ö Comparison of the Mann-Whitney U test and independent samples t-test (independent student’s t-test) in terms of power in small samples used in bioistatistic studies. 2021. 10.17094/ataunivbd.876777. [DOI] [Google Scholar]
64.Arjovsky M, Bottou L, Gulrajani I, Lopez-Paz D. Invariant risk minimization. arXiv preprint arXiv:190702893. 2019. 10.48550/arXiv.1907.02893. [DOI] [Google Scholar]
65.Sagawa S, Koh PW, Hashimoto TB, Liang P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:191108731. 2019. 10.48550/arXiv.1911.08731. [DOI] [Google Scholar]
66.Khosla P, Teterwak P, Wang C, et al. Supervised contrastive learning. Preprint. Posted online April 23, 2020. arXiv 2004.11362 10.48550/arXiv.2004.11362. [DOI] [Google Scholar]
67.Bian H, Chen Y, Luo E, et al. General-purpose pre-trained large cellular models for single-cell transcriptomics. National Science Review. 2024;11(11):nwae340. 10.1093/nsr/nwae340. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. 10.1136/bmj-2023-078378. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

NIHMS2122465-supplement-Supplementary_material.pdf^{(2.7MB, pdf)}

Data Availability Statement

[R1] 1.Consortium ITP-CAoWG. Pan-cancer analysis of whole genomes. Nature. 2020;578(7793):82–93. 10.1038/s41586-020-1969-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Liu DD, Zhang L. Trends in the characteristics of human functional genomic data on the gene expression omnibus, 2001–2017. Lab Invest. 2019;99(1):118–127. 10.1038/s41374-018-0125-5. [DOI] [PubMed] [Google Scholar]

[R3] 3.Zhang Y, Wang D, Peng M, et al. Single-cell RNA sequencing in cancer research. Journal of Experimental & Clinical Cancer Research. 2021;40:1–17. 10.1186/s13046-021-01874-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Rodon J, Soria J-C, Berger R, et al. Genomic and transcriptomic profiling expands precision cancer medicine: the WINTHER trial. Nature medicine. 2019;25(5):751–758. 10.1038/s41591-019-0424-4. [DOI] [Google Scholar]

[R5] 5.Sun S, Guo W, Wang Z, et al. Development and validation of an immune-related prognostic signature in lung adenocarcinoma. Cancer Medicine. 2020;9(16):5960–5975. 10.1002/cam4.3240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Liang J-y, Wang D-s, Lin H-c, et al. A novel ferroptosis-related gene signature for overall survival prediction in patients with hepatocellular carcinoma. International journal of biological sciences. 2020;16(13):2430. 10.7150/ijbs.45050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616(7956):259–265. 10.1038/s41586-023-05881-4. [DOI] [PubMed] [Google Scholar]

[R8] 8.Swanson K, Wu E, Zhang A, Alizadeh AA, Zou J. From patterns to patients: Advances in clinical machine learning for cancer diagnosis, prognosis, and treatment. Cell. 2023;186(8):1772–1791. 10.1016/j.cell.2023.01.035. [DOI] [PubMed] [Google Scholar]

[R9] 9.Tran KA, Kondrashova O, Bradley A, Williams ED, Pearson JV, Waddell N. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome medicine. 2021;13:1–17. 10.1186/s13073-021-00968-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Wafi A, Mirnezami R. Translational–omics: Future potential and current challenges in precision medicine. Methods. 2018;151:3–11. 10.1016/j.ymeth.2018.05.009. [DOI] [PubMed] [Google Scholar]

[R11] 11.Qian Y, Daza J, Itzel T, et al. Prognostic cancer gene expression signatures: current status and challenges. Cells. 2021;10(3):648. 10.3390/cells10030648. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Deng F, Zhang L. Association of normalization, non-differentially expressed genes and data source with machine learning performance in intra-dataset or cross-dataset modelling of transcriptomic and clinical data. arXiv preprint arXiv:250218888. 2025. [Google Scholar]

[R13] 13.Leek JT, Scharpf RB, Bravo HC, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics. 2010;11(10):733–739. 10.1038/nrg2825. [DOI] [Google Scholar]

[R14] 14.Corchete LA, Rojas EA, Alonso-López D, De Las Rivas J, Gutiérrez NC, Burguillo FJ. Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis. Scientific reports. 2020;10(1):19737. 10.1038/s41598-020-76881-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Zhang Y, Parmigiani G, Johnson WE. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR genomics and bioinformatics. 2020;2(3):lqaa078. 10.1093/nargab/lqaa078. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Ma B, Yan G, Chai B, Hou X. XGBLC: an improved survival prediction model based on XGBoost. Bioinformatics. 2022;38(2):410–418. 10.1093/bioinformatics/btab675. [DOI] [PubMed] [Google Scholar]

[R17] 17.Orouji S, Liu MC, Korem T, Peters MA. Domain adaptation in small-scale and heterogeneous biological datasets. Science Advances. 2024;10(51):eadp6040. 10.1126/sciadv.adp6040. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nature genetics. 2019;51(1):12–18. 10.1038/s41588-018-0295-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Hamid OH. From model-centric to data-centric AI: A paradigm shift or rather a complementary approach? 2022 8th International Conference on Information Technology Trends (ITT). 2022:196–199. 10.1109/ITT56123.2022.9863935. [DOI] [Google Scholar]

[R20] 20.Zha D, Bhat ZP, Lai K-H, Yang F, Hu X. Data-centric ai: Perspectives and challenges. Proceedings of the 2023 SIAM international conference on data mining (SDM). 2023:945–948. 10.1137/1.9781611977653.ch106. [DOI] [Google Scholar]

[R21] 21.Zha D, Bhat ZP, Lai K-H, et al. Data-centric Artificial Intelligence: A Survey. ACM Computing Surveys. 2025;57(5):1–42. 10.1145/3711118. [DOI] [Google Scholar]

[R22] 22.Hoadley KA, Yau C, Hinoue T, et al. Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell. 2018;173(2):291–304 e296. 10.1016/j.cell.2018.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Chen J, Yang H, Teo ASM, et al. Genomic landscape of lung adenocarcinoma in East Asians. Nat Genet. 2020;52(2):177–186. 10.1038/s41588-019-0569-6. [DOI] [PubMed] [Google Scholar]

[R24] 24.Wang L-B, Karpova A, Gritsenko MA, et al. Proteogenomic and metabolomic characterization of human glioblastoma. Cancer cell. 2021;39(4):509–528. e520. 10.1016/j.ccell.2021.01.006 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Deng Fei, Feng Catherine H., Gao Nan, Normalization Zhang L. and Selecting Non-Differentially Expressed Genes Improve Machine Learning Modelling of Cross-Platform Transcriptomic Data. Transactions on Artificial Intelligence. 2025;1(1):5. 10.53941/tai.2025.100005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Foltz SM, Greene CS, Taroni JN. Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously. Communications Biology. 2023;6(1):222. 10.1038/s42003-023-04588-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Thompson J, Tan J, Greene C. Cross-platform normalization of microarray and RNA-seq data for machine learning applications. 2016. 10.1038/s42003-023-04588-6. [DOI] [Google Scholar]

[R28] 28.Mansour M Non-parametric statistical test for testing exponentiality with applications in medical research. Statistical Methods in Medical Research. 2020;29(2):413–420. 10.1177/0962280218824979. [DOI] [PubMed] [Google Scholar]

[R29] 29.Wu X, Geng Z, Zhao Q. Non-parametric statistics 2017. [Google Scholar]

[R30] 30.Vandesompele J, De Preter K, Pattyn F, et al. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome biology. 2002;3:1–12. 10.1186/gb-2002-3-7-research0034. [DOI] [Google Scholar]

[R31] 31.Noble WS. What is a support vector machine? Nature biotechnology. 2006;24(12):1565–1567. 10.1038/nbt1206-1565. [DOI] [Google Scholar]

[R32] 32.Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied logistic regression: John Wiley & Sons; 2013. [Google Scholar]

[R33] 33.Ma B, Meng F, Yan G, Yan H, Chai B, Song F. Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data. Computers in biology and medicine. 2020;121:103761. 10.1016/j.compbiomed.2020.103761. [DOI] [PubMed] [Google Scholar]

[R34] 34.Sheridan RP, Wang WM, Liaw A, Ma J, Gifford EM. Extreme gradient boosting as a method for quantitative structure–activity relationships. Journal of chemical information and modeling. 2016;56(12):2353–2360. 10.1021/acs.jcim.6b00591. [DOI] [PubMed] [Google Scholar]

[R35] 35.Roth V The generalized LASSO. IEEE transactions on neural networks. 2004;15(1):16–28. 10.1109/TNN.2003.809398. [DOI] [PubMed] [Google Scholar]

[R36] 36.Chen M, Li P, Yao H, et al. Development and validation of a novel defined mutation classifier based on Lasso logistic regression for predicting the overall survival of immune checkpoint inhibitor therapy in renal cell carcinoma. Translational Andrology and Urology. 2023;12(3):406. 10.21037/tau-23-21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Dunne RA. A statistical approach to neural networks for pattern recognition: John Wiley & Sons; 2007. [Google Scholar]

[R38] 38.Karthik S, Sudha M. A survey on machine learning approaches in gene expression classification in modelling computational diagnostic system for complex diseases. International Journal of Engineering and Advanced Technology. 2018;8(2):182–191. [Google Scholar]

[R39] 39.Deng F, Zhao L, Yu N, Lin Y, Zhang L. Union With Recursive Feature Elimination: A Feature Selection Framework to Improve the Classification Performance of Multicategory Causes of Death in Colorectal Cancer. Lab Invest. 2024;104(3):100320. 10.1016/j.labinv.2023.100320. [DOI] [PubMed] [Google Scholar]

[R40] 40.Chicco D, Tötsch N, Jurman G. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData mining. 2021;14:1–22. 10.1186/s13040-021-00244-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Khatun N Applications of normality test in statistical analysis. Open journal of statistics. 2021;11(01):113. 10.4236/ojs.2021.111006. [DOI] [Google Scholar]

[R42] 42.Thadewald T, Büning H. Jarque–Bera test and its competitors for testing normality–a power comparison. Journal of applied statistics. 2007;34(1):87–105. 10.1080/02664760600994539. [DOI] [Google Scholar]

[R43] 43.Guzik P, Więckowska B. Data distribution analysis–a preliminary approach to quantitative data in biomedical research. Journal of Medical Science. 2023;92(2):e869–e869. 10.20883/medical.e869. [DOI] [Google Scholar]

[R44] 44.MacFarland TW, Yates JM, MacFarland TW, Yates JM. Kruskal–Wallis H-test for oneway analysis of variance (ANOVA) by ranks. Introduction to nonparametric statistics for the biological sciences using R. 2016:177–211. 10.1007/978-3-319-30634-6_6. [DOI] [Google Scholar]

[R45] 45.Li Z Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Computers, Environment and Urban Systems. 2022;96. 10.1016/j.compenvurbsys.2022.101845. [DOI] [Google Scholar]

[R46] 46.Lundberg SM, Erion G, Chen H, et al. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat Mach Intell. 2020;2(1):56–67. 10.1038/s42256-019-0138-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Van den Broeck G, Lykov A, Schleich M, Suciu D. On the tractability of SHAP explanations. Journal of Artificial Intelligence Research. 2022;74:851–886. 10.1613/jair.1.13283. [DOI] [Google Scholar]

[R48] 48.Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016:785–794. 10.1145/2939672.2939785. [DOI] [Google Scholar]

[R49] 49.Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017;30. 10.48550/arXiv.1705.07874. [DOI] [Google Scholar]

[R50] 50.Kumar IE, Venkatasubramanian S, Scheidegger C, Friedler S. Problems with Shapley-value-based explanations as feature importance measures. International conference on machine learning. 2020:5491–5500. 10.48550/arXiv.2002.11097. [DOI] [Google Scholar]

[R51] 51.Razali NM, Wah YB. Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. Journal of statistical modeling and analytics. 2011;2(1):21–33. [Google Scholar]

[R52] 52.Chicco D, Sichenze A, Jurman G. A simple guide to the use of Student’s t-test, Mann-Whitney U test, Chi-squared test, and Kruskal-Wallis test in biostatistics. BioData Mining. 2025;18(1):56. 10.1186/s13040-025-00465-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Sedgwick P A comparison of parametric and non-parametric statistical tests. BMJ. 2015;350. 10.1136/bmj.h2053. [DOI] [Google Scholar]

[R54] 54.Fagerland MW. t-tests, non-parametric tests, and large studies—a paradox of statistical practice? BMC Medical Research Methodology. 2012;12(1):78. 10.1186/1471-2288-12-78. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Hollander M, Wolfe DA, Chicken E. Nonparametric statistical methods: John Wiley & Sons; 2013. [Google Scholar]

[R56] 56.Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. bioinformatics. 2007;23(19):2507–2517. 10.1093/bioinformatics/btm344. [DOI] [PubMed] [Google Scholar]

[R57] 57.D’Amour A, Heller K, Moldovan D, et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research. 2022;23(226):1–61. 10.48550/arXiv.2011.03395. [DOI] [Google Scholar]

[R58] 58.Demšar J Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research. 2006;7(Jan):1–30. 10.17094/ataunivbd.876777. [DOI] [Google Scholar]

[R59] 59.Moosbauer J, Herbinger J, Casalicchio G, Lindauer M, Bischl B. Explaining hyperparameter optimization via partial dependence plots. Advances in neural information processing systems. 2021;34:2280–2291. 10.48550/arXiv.2111.04820. [DOI] [Google Scholar]

[R60] 60.Montgomery DC. Design and analysis of experiments: John wiley & sons; 2017. [Google Scholar]

[R61] 61.Fisher A, Rudin C, Dominici F. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research. 2019;20(177):1–81. 10.48550/arXiv.1801.01489. [DOI] [Google Scholar]

[R62] 62.Bebchuk J, Wittes J. Fundamentals of biostatistics: Cambridge University Press; 2012. [Google Scholar]

[R63] 63.Eltas Ö Comparison of the Mann-Whitney U test and independent samples t-test (independent student’s t-test) in terms of power in small samples used in bioistatistic studies. 2021. 10.17094/ataunivbd.876777. [DOI] [Google Scholar]

[R64] 64.Arjovsky M, Bottou L, Gulrajani I, Lopez-Paz D. Invariant risk minimization. arXiv preprint arXiv:190702893. 2019. 10.48550/arXiv.1907.02893. [DOI] [Google Scholar]

[R65] 65.Sagawa S, Koh PW, Hashimoto TB, Liang P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:191108731. 2019. 10.48550/arXiv.1911.08731. [DOI] [Google Scholar]

[R66] 66.Khosla P, Teterwak P, Wang C, et al. Supervised contrastive learning. Preprint. Posted online April 23, 2020. arXiv 2004.11362 10.48550/arXiv.2004.11362. [DOI] [Google Scholar]

[R67] 67.Bian H, Chen Y, Luo E, et al. General-purpose pre-trained large cellular models for single-cell transcriptomics. National Science Review. 2024;11(11):nwae340. 10.1093/nsr/nwae340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R68] 68.Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. 10.1136/bmj-2023-078378. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Toward the best generalizable performance of machine learning in modeling omic and clinical data

Fei Deng

Yongfeng Zhang

Lanjing Zhang

Abstract

Introduction