Abstract
High-throughput omics data present challenges for binary classification due to platform variability, batch effects, missing values, and high dimensionality. This study presents a novel Rank-Based Learning (RBL) method that leverages relative feature rankings to improve robustness and generalizability. We evaluated RBL against established methods like Logistic Regression (LR) and Random Forest (RF) using simulated data and two real-world plasma proteomics datasets: early-stage small cell lung cancer (SCLC) and duodenopancreatic neuroendocrine tumors (dpNET) in patients with Multiple Endocrine Neoplasia type 1 (MEN1). In simulation experiments, RBL outperformed LR under conditions involving batch effects, missing data, and varying numbers of true differential features. In SCLC, RBL yielded a test AUC of 0.76 (95% CI: 0.42–1.00), surpassing LR with Lasso (0.65 [95% CI: 0.47–0.84]) and RF with feature importance (0.59 [95% CI: 0.33–0.87]). In dpNET, RBL achieved an AUC of 0.83 (95% CI: 0.67–0.97) on the development set and 0.80 (95% CI: 0.54–0.98) on the test set, outperforming LR with Lasso (0.57 [95% CI: 0.40–0.77]) and RF with feature importance (0.53 [95% CI: 0.29–0.77]). By emphasizing feature ranking rather than absolute expression levels, RBL effectively mitigates the impact of non-biological variation. Overall, RBL improves the predictive accuracy of diagnostic models for complex diseases and provides a promising framework for developing more reliable and generalizable diagnostic tools from omics data, moving them closer to clinical application.
Keywords: rank-based learning, high-throughput omics, missing data, machine learning
Introduction
High-throughput omics technologies, such as genomics, proteomics, and metabolomics, have revolutionized biomedical research by enabling comprehensive analysis of biological molecules on a large scale [1–4]. These technologies facilitate the identification of genetic variants, protein expression patterns, and metabolic profiles, offering critical insights into the molecular mechanisms underlying diseases. By capturing vast quantities of biological information, omics technologies have substantially advanced our ability to classify disease subtypes and identify biomarkers for early detection, prognosis, and therapeutic response. For example, we conducted a comprehensive proteomic profiling of plasma samples from small-cell lung cancer (SCLC) patients and identified circulating protein markers associated with disease pathogenesis [5]. Similarly, gene expression profiling was used to classify breast cancer subtypes and predict clinical outcomes, demonstrating the prognostic power of genomic data [6]. More recent studies have further enhanced cancer subtype classification and biomarker discovery through multi-omics integration and deep learning approaches [7–9].
Despite these advances, high-throughput omics technologies face key limitations due to their dependence on specific experimental platforms [10–13]. Many studies are constrained by the technical characteristics of the platforms used to generate the data, which limits the generalizability of the findings to be cross validated across different platforms. Most existing models rely on the absolute magnitudes of selected features within a particular platform [14, 15]. Consequently, variations in the data pipeline, such as differences in sample handling protocols, can introduce systematic shifts that lead to inconsistent predictions when the same algorithm is applied to new data. Additionally, such algorithms often depend on the platform’s sensitivity to reproducibly capture the same features. Consequently, when applied to data generated from a different platform, their performance may degrade, resulting in poor reproducibility across studies or experimental conditions [15–17].
Another common limitation in omics research is the "large p, small n" problem, in which the number of features (p) greatly exceeds the number of samples (n) [18–21]. This challenge is typically mitigated by applying feature-selection procedures to retain the most discriminative variables across experimental conditions [22–27]. However, this process can inadvertently limit the discovery potential of omics studies, as it focuses on a small subset of features, and may exclude biologically relevant signals.
Recent deep learning frameworks, such as scAMZI, scRGCL, and scMGATGRN, have advanced omics and single-cell analyses by applying attention mechanisms and graph neural networks to capture complex biological dependencies [28–30]. Similarly, ensemble transfer learning models, such as the voting transfer approach, have improved protein-level classification under high-dimensional and imbalanced conditions. While these deep learning methods demonstrate strong predictive performance, they often require large datasets and complex architectures that are often prone to overfitting, particularly in small-sample high-dimensional settings. [31, 32].
To address these challenges, we developed a novel Rank-Based Learning (RBL) algorithm designed that enhances robustness to platform variability and data inconsistencies. Instead of relying on absolute values, RBL leverages the ranking distribution of features for classification, identifying the optimal ranking patterns that that optimally distinguish between sample classes across multiple datasets. This approach preserves the full spectrum of informative features, reducing the risk of excluding potentially important signals. Because it operates on rankings rather than raw measurements, RBL facilitates cross-platform validation and remains resilient to missing values by incorporating all available features within each sample.
This paper is organized as follows. Section 2 presents the superiority of RBL method over LR in simulations and over both LR and RF in real-world datasets. Section 3 provides a discussion of our findings and concluding remarks. Section 4 describes the RBL framework.
Results
We evaluated the performance of the RBL method in four different simulation studies and two real-world applications as explained in detail in the Material and Method section.
Simulation studies
True differential scenario
In simulations with 60 observations (30 cases and 30 controls) and 300 features, for 1% true differential features, the test AUCs were 0.82 (95% CI: 0.80–0.84) for RBL and 0.83 (95% CI: 0.80–0.87) for LR. As the proportion of true differential features increased to 5%, 10%, and 20%, RBL’s test AUCs improved to 0.95 (95% CI: 0.94–0.96), 0.99 (95% CI: 0.99–1.00), and 1.00 (95% CI: 1.00–1.00), respectively. In comparison, LR achieved test AUCs of 0.88 (95% CI: 0.86–0.91), 0.85 (95% CI: 0.83–0.88), and 0.85 (95% CI: 0.83–0.88) for the same scenarios (Supplementary Table S1; Supplementary Fig. S1).
In simulations with an increased sample size of 200 observations (100 cases and 100 controls), RBL consistently outperformed LR in scenarios with 5% true differential features, achieving a test AUC of 0.90 (95% CI: 0.89–0.91) compared to LR’s 0.85 (95% CI: 0.83–0.86) (Supplementary Table S2).
Missing scenario
With 1% true differential features and 10% missing values, RBL achieved a test AUC of
0.81 (95% CI: 0.79–0.83), compared to 0.69 (95% CI: 0.66–0.73) for LR. As the percentage of missing values increased from 10% to 50%, RBL consistently outperformed LR in both development and test set. Specifically, at 50% missingness, RBL attained a test AUC of 0.72 (95% CI: 0.69–0.74) versus 0.53 (95% CI: 0.51–0.56) for LR. This trend continued at higher proportions of true differential features. At 5% true differential features and 50% missingness, RBL maintained a test AUC of 0.82 (95% CI: 0.80–0.84), whereas LR had a test AUC of 0.55 (95% CI: 0.52–0.58). At 10% true differential features and the same level of missingness, RBL’s test AUC was 0.92 (95% CI: 0.90–0.94) compared to LR’s AUC of 0.59 (95% CI: 0.56–0.61). Finally, at the highest true differential feature level (20%) and 50% missingness, RBL achieved a test AUC of 0.95 (95% CI: 0.94–0.97), substantially outperforming LR’s AUC of 0.63 (95% CI: 0.60–0.66) (Supplementary Table S3; Supplementary Fig. S2).
Batch effect scenario
With 1% true differential features, RBL achieved a test AUC of 0.98 (95% CI: 0.98–0.98), outperforming LR at 0.69 (95% CI: 0.67–0.70). At 5%, the test AUC for RBL reached 1.00 (95% CI: 1.00–1.00), while LR attained 0.70 (95% CI: 0.69–0.72). At 10% and 20% true differential features, RBL consistently achieved test AUCs of 1.00 (95% CI: 1.00–1.00), whereas LR reached test AUCs of 0.72 (95% CI: 0.70–0.73) and 0.71 (95% CI: 0.70–0.72), respectively (Supplementary Table S4; Supplementary Fig. S3).
Correlation scenario
In this setting, both methods showed consistently high development AUCs. RBL test AUCs increased with the proportion of true differential features, from 0.86 (95% CI: 0.83–0.88) at 1% to 1.00 (95% CI: 1.00–1.00) at 10% and 20%. LR followed a similar trend, with test AUCs rising from 0.96 (95% CI: 0.94–0.98) to 1.00 (95% CI: 1.00–1.00). Both methods performed equally well when 5% or more of the features were truly differential. (Supplementary Table S5; Supplementary Fig. S4).
Real-world data application
Proteomics dataset for detection of newly diagnosed early-stage small-cell lung cancer
The dataset contains 4,388 gene-protein products. Among the cases, eight are male and seven are female, with the same distribution in the controls. Six cases are in stage I, and nine are in stage II. The mean ages for cases and controls are 67 and 64 years, respectively (Table 1).
Table 1.
Patient and tumor characteristics for SCLC cohort
| Patient and Tumor Characteristics | cases | controls |
|---|---|---|
| N | 15 | 15 |
| Age, mean (std) | 67 (10) | 64 (5) |
| Sex, N (%) | ||
| Male | 8 (53.3%) | 8 (53.3%) |
| Female | 7 (46.7%) | 7 (46.7%) |
| Stage, N (%) | ||
| I | 6 (40%) | - |
| II | 9 (60%) | - |
| Smoking Pack years, mean (std) | 63 (27) | 51 (18) |
RBL achieved an AUC of 0.84 (95% CI: 0.60–1.00) on the development set and 0.76 (95% CI: 0.42–1.00) on the test set. In comparison, LR and RF with seven features after forward feature selection yielded AUCs of 0.98 (95% CI: 0.92–1.00) and 1.00 (95% CI: 0.98–1.00) on the development set, 0.57 (95% CI: 0.49– 0.88) and 0.57 (0.43–0.93) on the test set, respectively. LR with six features selected by Lasso achieved an AUC of 0.95 (95% CI: 0.90–1.00) on the development set and 0.65 (95% CI: 0.47–0.84) on the test set. Using 20 RankProd-selected features, LR achieved 1.00 (95% CI: 0.90–1.00) on the development set and 0.59 (95% CI: 0.37–0.92) on the test set. RF with two features selected by feature importance showed an AUC of 1.00 (95% CI: 0.98–1.00) and 0.59 (95% CI: 0.36–0.87) on the development and test set, respectively (Table 2; Supplementary Fig. S5).
Table 2.
Performances of RBL, LR, and RF in plasma proteomics for detection of early-stage SCLC
| Model | Feature Selection | Development set AUC (95% CI) | Test set AUC (95% CI) |
|---|---|---|---|
| LR | Lasso | 0.95 (0.90–1.00) | 0.65 (0.47–0.84) |
| LR | Rank Product | 1.00 (0.90–1.00) | 0.59 (0.37–0.92) |
| LR | Forward selection | 0.98 (0.92–1.00) | 0.57 (0.49–0.88) |
| RF | Feature importance | 1.00 (0.98–1.00) | 0.59 (0.36–0.87) |
| RF | Forward selection | 1.00 (0.98–1.00) | 0.57 (0.43–0.93) |
| RBL | — | 0.84 (0.60–1.00) | 0.76 (0.42–1.00) |
Proteomics dataset for detection of duodenopancreatic neuroendocrine tumors in patients with multiple endocrine neoplasia type 1
The dataset contains 10,937 gene-protein products. Among the cases, six are male and eight are female. Control group 1 consists of 13 male and 15 female. Control group2 includes seven male and seven female (Table 3).
Table 3.
Patient and tumor characteristics for the MEN1 cohort
| Cases | Controls#1 | Controls#2 | |
|---|---|---|---|
| n | 14 | 28 | 14 |
| Sex, n (%) | |||
| Male | 6 (43) | 13 (46) | 7 (50) |
| Female | 8 (57) | 15 (54) | 7 (50) |
| Age (median, IQR) | 52.5 (41.8-60) | 39.5 (28.5-58) | 29.5 (22-38.5) |
| BMI (median, IQR)a | 26 (22.8-32.3) | 26 (23-36) | 23 (20.5-24.5) |
| Collection site, n (%) | |||
| MDACC | 5 (36) | 3 (11) | 1 (7) |
| NIH-NIDDK | 5 (36) | 0 (0) | 1 (7) |
| UMCU | 4 (29) | 25 (89) | 12 (86) |
| PanNET | 13b | 28 | — |
| Prior dpNET surgery, n (%) | 7 (50) | 1 (4) | — |
| Size largest PanNET resected | |||
| ≤20 mm | 2 | — | — |
| >20 mm | 3 | 1 | — |
| N/A (only duodenal/lymph node) | 2 | — | — |
| Size of largest PanNET, n (%) (in situ at sample collection) |
12 (4-29) | 11 (6-23) | — |
| <20 mm | 11 (79) | 26 (93) | — |
| ≥20 mm | 2 (14) | 2 (7) | — |
Abbreviations: BMI, body mass index; dpNET, duodenopancreatic neuroendocrine tumor; IQR, interquartile range; MDACC, MD Anderson Cancer Center; N/A, not applicable; NIH-NIDDK, National Institutes of Health-National Institute of Diabetes and Digestive and Kidney Diseases; PanNET, pancreatic neuroendocrine tumor; UMCU, University Medical Center Utrecht.
a BMI data were not available for 10 control subjects.
b One case had total or partial pancreatectomy but presented with dpNET-related liver metastasis at the time of blood collection.
RBL yielded an AUC of 0.83 (95% CI: 0.67–0.97) on the development set and 0.80 (95% CI: 0.54–0.98) on the test set. In comparison, LR with 28 Lasso-selected features yielded a development AUC of 0.80 (95% CI: 0.56–0.80) and a test AUC of 0.57 (95% CI: 0.40–0.77). LR with two RankProd-selected features achieved development and test AUCs of 0.74 (95% CI: 0.50-0.88) and 0.50 (95% CI: 0.18, 0.85), respectively. RF with four features selected by feature importance demonstrated a development AUC of 1.00 (95% CI: 0.96–1.00) and a test AUC of 0.53 (95% CI: 0.29–0.77) (Table 4; Supplementary Fig. S6).
Table 4.
Performances of RBL, LR, and RF in plasma proteomics for detection of dpNET in patients with MEN1
| Model | Feature Selection | Development set AUC (95% CI) | Test set AUC (95% CI) |
|---|---|---|---|
| LR | Lasso | 0.80 (0.56-0.80) | 0.57 (0.40-0.77) |
| LR | Rank Product | 0.74 (0.50, 0.88) | 0.50 (0.18, 0.85) |
| RF | Feature importance | 1.00 (0.96-1.00) | 0.53 (0.29-0.77) |
| RBL | — | 0.83 (0.67-0.97) | 0.80 (0.54-0.98) |
Discussion
This study presents a novel machine learning method, termed Rank-Based Learning (RBL), designed to classify high-throughput omics data by estimating the optimal ranking profile of features that best distinguishes two groups, such as cancer versus healthy controls. The RBL employs a similarity-based scoring function to assess rankings and uses the Metropolis–Hastings stochastic search algorithm to identify the optimal feature order. The method was systematically evaluated across four simulation scenarios as well as two real-world, SCLC and duodenopancreatic neuroendocrine tumors (dpNET), and compared with established approaches including Logistic Regression (LR) and Random Forest (RF). RBL's performance improved with increasing proportions of true differential features and demonstrated robustness under conditions of missingness and batch effects. Notably, RBL outperformed LR and RF in test AUC across both real-world datasets.
By leveraging feature rankings rather than absolute measurements, RBL effectively handles high-dimensional data and enhances biomarker discovery when the sample size is limited relative to the number of features. This property makes RBL particularly well-suited for omics data, which are often affected by technical and biological variability [10]. Through its reliance on relative orderings, RBL mitigates the impact of batch effects and normalization discrepancies that commonly challenge high-throughput analyses. This robustness enhances its applicability in clinical research settings, where data heterogeneity is a significant challenge. Additionally, RBL incorporates all available quantified features that makes it more generalizable to new test datasets compared to methods that rely on feature selection, improving generalizability to independent test datasets and maintaining stability even in the presence of missing values.
In both SCLC and dpNET datasets, RBL exhibited slightly lower development AUCs compared to conventional methods but consistently superior generalization on test data. Specifically, RBL achieved development AUCs of 0.84 (95% CI: 0.60–1.00) for SCLC and 0.83 (95% CI: 0.67–0.97) for dpNET. This difference likely reflects RBL's reliance on relative feature rankings rather than dataset-specific absolute intensities. Despite this, RBL’s test set AUCs: 0.76 (95% CI: 0.42–1.00) for SCLC and 0.80 (95% CI: 0.54–0.98) for dpNET, underscoring its stronger generalization to unseen data.
In the SCLC dataset, RBL identified an optimal feature rank profile (Supplementary Table S7) that captures distinctive relative expression patterns characteristic of SCLC. Within this profile, several known SCLC-associated overexpressed proteins, including NCAM1, FUT1, KSR2 and TFRC [33–36], were consistently ranked above ACTB, a highly abundant and ubiquitously expressed protein often used as a stable internal control in molecular biology experiments due to its relatively constant expression across different cell types and physiological conditions [37]. This ranking pattern indicates that these SCLC-related proteins are consistently elevated relative to ACTB in SCLC cases, whereas such ordering is absent or reversed in controls.
Similarly, in the dpNET dataset, proteins such as IGFBP2, CHI3L1, TIMP1, and COL18A1 also consistently ranked above ACTB in the optimal feature rank profile (Supplementary Table S8). These proteins have previously been reported as elevated in early-stage pancreatic ductal adenocarcinoma (PDAC) cases compared to healthy or benign conditions [38, 39], further demonstrating RBL's capability to uncover biologically meaningful relative expression patterns across different cancer types.
Beyond these findings, biological validation of the learned ranking profiles can further strengthen the interpretability. The high-ranking proteins identified by RBL can be validated through independent experimental assays, such as ELISA, Western blot, or immunohistochemistry, to confirm their differential abundance between cancer and control samples. Moreover, proteins that consistently appear above stable housekeeping proteins (e.g., ACTB) in the ranking profile may represent biologically relevant targets. These candidates can be investigated using functional studies, such as gene knockdown, knockout, or overexpression experiments, to determine their mechanistic roles in tumor development or progression. Together, these approaches may provide a biologically grounded framework for validating and interpreting RBL-derived rank profiles across independent datasets.
This study has several limitations. First, RBL is designed for binary classification and cannot be directly applied to multi-class problems. However, the framework has flexibility to be extended by adapting its loss function to evaluate similarity across multiple classes. One approach is to learn pairwise rank profiles between classes using one-vs-one or one-vs-rest schemes, where the total similarity score, originally computed as the difference between cases and controls, is redefined to compare each target class against its reference class (e.g., all remaining samples). Future work may also explore a unified formulation that learns a single ranking profile to maximize separation among class-specific mean similarity scores while minimizing within-class variability. These extensions would enable RBL to generalize naturally to multi-class settings while preserving its rank-based framework.
Second, RBL focuses exclusively on ranking features without considering the magnitude of intensity differences, potentially overlooking certain quantitative effects.
Furthermore, RBL can be computationally intensive, especially for datasets with large number of features. To improve efficiency, future implementations could leverage Numba-based just-in-time (JIT) compilation, which compiles Python functions into optimized machine code at runtime; delta-based incremental updates that reuses pairwise comparisons during MCMC iterations; and parallelization across high-performance computing (HPC) clusters to further accelerate computation.
Additionally, while RBL mitigates missing data by excluding pairs with missing values, RBL is appropriate when missingness is random, it may introduce bias if missingness is non-random. This limitation is not unique to RBL; existing methods, including statistical and machine learning approaches, also struggle with missing-not-at-random (MNAR) mechanisms [40–42]. Addressing such informative missingness remains a well-recognized challenge in omics analyses and represents an important direction for future methodological development.
In conclusion, RBL is a promising learning algorithm for binary classifications with high-throughput omics data as the inputs, even in the presence of missing data and batch effects.
Materials and Methods
Development of RBL
Let
be a permutation of
, which we call a feature rank profile. Using a training data set, we seek a feature rank profile
which is highly concordant with group 1 (i.e. cases) but discordant with group 2 (i.e. controls) when comparing two groups of data, defined precisely in the following paragraphs. Once the feature rank profile is determined, it can be used to compute risk scores for a test set.
Let
denote a vector of biomarker values (e.g. protein expression). We define a similarity score to indicate if the ranking of features
and
in
agrees with rank profile
. Specifically let
![]() |
(1) |
for
If the relative rank between
and
is the same in both rank profile
and feature vector
, the score is 1. If the ordering between them is reversed, the score is −1. If either
or
is missing, the ordering cannot be determined, and the score is 0.
We sum the scores over all possible pairs of features to obtain the similarity score of
with
:
![]() |
(2) |
Let
for
denote biomarker expression vectors for
cases and
for
denote biomarker expression vectors for
controls. We sum similarity across cases and controls separately and compute the difference. The result is termed the total similarity score (TSS):
![]() |
![]() |
![]() |
(3) |
The goal is to find the ranking profile
that maximizes the TSS:
![]() |
(4) |
We discuss algorithms for solving Equation 4 below. Once
is determined, risk scores for each sample in the test set T are computed using Equation 2 by evaluating their similarity to the learned ranking profile.
Steps repetition, pseudo-code, and flowchart of the RBL
The objective function is defined over the space of all permutations of p elements, resulting in a factorially large, discrete search space. As traditional gradient-based methods are inapplicable, we adopt the Metropolis–Hastings stochastic search method [43], which is effective for exploring large discrete spaces. The process starts by initializing 100 random permutations and calculating their total similarity scores as described in Equations (1)–(3), from which corresponding AUCs are calculated. The permutation with the highest combined AUC across Development Data 1 (D1), D2, and D3 is selected as the initial candidate. At each iteration, two elements in the permutation are swapped, similarity scores are recalculated, and the permutation is updated if performance improves. If not, the swap is reverted, and a new pair is randomly selected. This process continues until no further improvement is observed in D1 based on a predefined convergence criterion. The best-performing permutation is then selected. A flowchart of the process is shown in Fig. 1, and the pseudo-code is provided in Table 5.
Figure 1.
Flowchart of looking for the optimal permutation.
Table 5.
The algorithm of finding the optimal similarity score
| Algorithm 1. Pseudo-code of RBL |
Start RBL
|
Algorithm input and output
Algorithm 1 in Table 5 performs searching the optimal similarity score based on the following input and output:
Input: Parameter g (defining the convergence criterion by limiting the number of iterations) and development datasets (D1, D2, D3 – labeled datasets containing feature vectors and class labels, e.g., case or control),
Output: Optimal permutation
= {a1, a2, ..., ap}, where each ai ∈
denotes the index of the feature ranked at position i.
Computational complexity analysis
RBL consists of three main components: initialization, score computation, and iterative permutation updates. Among them, C denotes the number of permutations generated for initialization, N denotes the number of samples, K denotes the number of feature pairs. T denotes the maximum number of iterations, and P denotes the number of parameters needed to be validated. The computation complexities of calculating similarity score, initialization and permutation updates are O(N×K)), O(C×N×K), and O(T×N×K), respectively. Therefore, the total complexity of RBL is O((C+T) ×N×K×P). Given this computational cost, RBL could be applied through high-performance computing (HPC) resources. Computational settings and runtime details are summarized in Supplementary Table S6.
Model evaluation
For each test sample, a similarity score was computed by comparing its feature ranking to the optimal rank profile learned from the training data. Higher scores indicated greater alignment with case-like patterns; lower scores suggested control-like profiles. These similarity scores served as continuous prediction outputs. The performance of RBL was evaluated using the Area Under the Receiver Operating Characteristic Curve (AUC). We compared the performance of RBL with LR and RF, incorporating various feature selection strategies including LR with Lasso, LR with Rank Product-based feature selection, a non-parametric method that identifies features consistently ranked across samples [44], RF with feature importance-based selection, and forward feature selection in the real datasets. The performance of RBL is compared with LR with forward feature selection in simulation datasets. For forward selection, features were sequentially selected using LR, with the optimal feature set determined via cross-validation. The resulting feature subset was then used to train both LR and RF models. For the real datasets, confidence intervals (95% CI) for AUC were estimated via bootstrap (1,000 resamples with replacement). For simulation experiments, 95% CIs were computed using the t-distribution to assess mean AUC performance across simulation replicates (n = 100), calculated as:
CI = mean ± tα/2, df × SE,
where
= 0.05, and tα/2, df is the critical value from the t-distribution with df = n − 1 degrees of freedom, and SE is the standard error.
Software implementation
RBL was implemented in Python 3, and the source code is available at:
Dataset description
Simulation studies
We evaluated the performance of RBL using simulated data, comparing it with LR with forward feature selection across various scenarios described below. Unless otherwise specified, all datasets contained 300 features and 60 observations (30 cases and 30 controls). The batch effect scenario included 120 observations (60 cases and 60 controls). To further assess consistency and robustness, we conducted supplementary simulations with an increased sample size (200 observations: 100 cases and 100 controls) on scenarios with 5% true differential features. Features were sampled from normal distributions with means and variances randomly chosen from uniform distributions of U(50, 70) and U(1, 80), respectively. The simulations were considered:
a. Basic true differential scenarios
To assess performance under varying signal strengths, we simulated datasets containing 1%, 5%, 10%, and 20% truly differentially expressed features. Differential expression was induced by adding a constant (sampled from N (60,60)) to the selected features in one class. Significance was verified via two-sample t-tests.
b. Missing data scenarios
To assess robustness to missingness, increasing proportions of data were randomly removed, from 10% up to 50%, which were introduced at random.
c. Batch effect scenarios
Batch effects—systematic variations arising from differences in processing time, operator, equipment, or experimental conditions—were introduced to examine model robustness.
We added batch-specific constants to all features within each group, altering absolute values while preserving within-sample rankings. Because RBL relies on feature ranks rather than magnitudes, it is expected to be robust to such shifts.
For the batch effect scenarios, the basic true differential dataset was divided into three groups of 20 observations each (10 cases and 10 controls). Batch effects, sampled from U(100, 300), U (300, 500), and U(500, 700) were added to the respective groups, increasing absolute intensities but preserving rank order.
d. Correlation scenarios
To evaluate model sensitivity to feature correlation, we induced high correlation among truly differential features by adding shared values from N(10,10) to the case group.
Model training and testing in the simulation datasets
For the basic differential, missing, and correlation scenarios, the datasets were split into 70% training and 30% test sets. For the batch effect scenario, models were trained on unshifted data and evaluated on batch-affected data (60 observations each).
Proteomics dataset for detection of newly diagnosed early-stage SCLC
The dataset consisted of plasma samples from 15 newly diagnosed early-stage SCLC cases and 15 matched controls (age, sex, and smoking status) from MD Anderson Cancer Center (MDACC). SCLC cases were obtained from the Genomics Marker-Guided Therapy Initiative (GEM-INI) project, while controls were selected from the Lung Cancer Early Detection Assessment of Risk and Prevention (LEAP) study (IRB protocol 2013-0609). All controls remained cancer-free for at least four years after blood collection. Additional information is provided elsewhere [5].
Proteomics dataset for detection of dpNET in patients with multiple endocrine neoplasia type 1
This proteomics dataset for dpNET detection in patients with Multiple Endocrine Neoplasia type 1 (MEN1) was collected through an international collaboration between MD Anderson Cancer Center (MDACC), the National Institutes of Health (NIH), and the University Medical Center Utrecht (UMCU). All biospecimens and associated retrospectively collected clinical data were approved under MDACC protocol PA19-0498, with a waiver of informed consent. EDTA plasma samples were obtained from 14 MEN1 case subjects presenting with liver metastases from a dpNET and two types of controls: patients with MEN1 and a nonmetastatic (distant or regional) indolent dpNET (n = 28; controls-1) defined by at least 3 years of follow-up after a dpNET diagnosis and imaging taken more than 1 year after blood draw confirming absence of distant or regional metastases and patients with MEN1 without a visible dpNET or other NET (n = 14; controls-2), confirmed negative for dpNETs at blood collection time by either combined conventional and somatostatin-receptor imaging, or conventional imaging taken ≥6 months post-blood draw. Patients were included if they met MEN1 diagnostic criteria: (i) a confirmed germline MEN1 mutation; (ii) one of the three major manifestations (parathyroid, pituitary, dpNET) plus a first-degree family member with a confirmed MEN1 mutation (or if no genetic testing was performed); or (iii) two of the three major manifestations (including a dpNET) plus a first-degree family member meeting the same criteria. Exclusion criteria included any active non-NET malignancy, an active thymus NET or thymoma, rapidly progressive or metastatic lung or gastric NET, and poorly differentiated neuroendocrine carcinoma. Additional information is provided elsewhere [45].
Model training and testing in the real datasets
For the SCLC dataset, the initial split consisted of a development set with eight case-control matched pairs (D) and a testing set (T) with seven pairs. The development set was further divided into a development subset with five matched pairs (D1) and a validation subset with three matched pairs (D2). An additional validation set (D3) included three matched pairs randomly sampled from the D1.
For the MEN1 dataset, the dataset was randomly spited into 70% of development set (D; n = 39) and 30% of test set (T; n = 17). The development set was further divided into a training set (D1; n = 27) and validation set (D2; n = 12). An additional validation set (D3; n = 12) was randomly sample from D1 (Supplementary Fig. S7).
Key Points
High-throughput omics data present major challenges for machine learning due to batch effects, missing values, and technical variability, often leading to reduced predictive accuracy.
We introduce Rank-Based Learning (RBL), a novel method that leverages feature ranking to improve robustness against non-biological variations.
RBL consistently outperforms Logistic Regression and Random Forest in simulations and in two plasma proteomics datasets for early-stage SCLC and dpNET detection.
By improving prediction accuracy and generalizability, RBL offers a promising approach for developing more reliable diagnostic tools from omics data, with greater potential for clinical translation.
Supplementary Material
Acknowledgments
The SCLC proteomics dataset was obtained from the Genomics Marker-Guided Therapy Initiative (GEM-INI) and the Lung Cancer Early Detection Assessment of Risk and Prevention (LEAP) study at MD Anderson Cancer Center (IRB protocol 2013-0609). The MEN1 dpNET dataset was collected through collaborations among MD Anderson Cancer Center, the National Institutes of Health, and the University Medical Center Utrecht (protocol PA19-0498). This work was supported by the National Institutes of Health [U01CA271888 to S.H., RP160693 to K.-A.D., P50CA140388 to K.-A.D. and J.P.L., TR000371 to K.-A.D. and J.P.L.]; and by the philanthropic contributions to The University of Texas MD Anderson Cancer Center Moon Shots Program. We would like to acknowledge the UT Austin/MD Anderson Accelerator Grant for its partial support of this work.
Contributor Information
Lulu Song, Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Hamid Khoshfekr Rudsari, Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Johannes F Fahrmann, Department of Cancer Prevention, University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Jody Vykoukal, Department of Cancer Prevention, University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Sam Hanash, Department of Cancer Prevention, University of Texas MD Anderson Cancer Center, Houston, TX, USA.
James P Long, Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Kim-Anh Do, Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Ehsan Irajizad, Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Author contributions
Lulu Song and Ehsan Irajizad developed methodology and implemented algorithms; Lulu Song performed data analysis; Lulu Song and Ehsan Irajizad wrote the manuscript; Hamid Khoshfekr Rudsari, Johannes Fahrmann, Jody Vykoukal, and James P. Long contributed to discussion; Sam Hanash and Kim-Anh Do. supervised the project. All authors have read and agreed to the published version of the manuscript.
Competing interest statement
The authors declare no competing interests.
Data availability
Simulated code resulting in generating simulation data this work is publicly available at: https://github.com/lulusong512/Rank-Based-Learning.git. Access to real-world clinical data can be provided upon reasonable request, subject to approval and completion of the required inter-institutional Material Transfer Agreement (MTA).
References
- 1. Vitorino R. Transforming clinical research: The power of high-throughput omics integration. Proteomes. 2024;12:25. 10.3390/proteomes12030025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Dai X, Shen L. Advances and trends in omics technology development. Front Med 2022;9:911861. 10.3389/fmed.2022.911861 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Suravajhala P, Kogelman LJ, Kadarmideen HN. Multi-omic data integration and analysis using systems genomics approaches: Methods and applications in animal production, health and welfare. Genet Sel Evol 2016;48:38–14. 10.1186/s12711-016-0217-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Sanches PHG, de Melo NC, Porcari AM. et al. Integrating molecular perspectives: Strategies for comprehensive multi-omics integrative data analysis and machine learning applications in transcriptomics, proteomics, and metabolomics. Biology (Basel) 2024;13:848. 10.3390/biology13110848 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Fahrmann JF, Katayama H, Irajizad E. et al. Plasma-based protein signatures associated with small cell lung cancer. Cancers (Basel) 2021;13:3972. 10.3390/cancers13163972 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. van ’t Veer LJ, Dai H, van de Vijver MJ. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–6. 10.1038/415530a [DOI] [PubMed] [Google Scholar]
- 7. Wu J, Chen Z, Xiao S. et al. DeepMoIC: Multi-omics data integration via deep graph convolutional networks for cancer subtype classification. BMC Genomics 2024;25:1209. 10.1186/s12864-024-11112-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Cheng L, Huang Q, Zhu Z. et al. MoAGL-SA: A multi-omics adaptive integration method with graph learning and self-attention for cancer subtype classification. BMC Bioinformatics 2024;25:364. 10.1186/s12859-024-05989-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Pang J, Liang B, Ding R. et al. A denoised multi-omics integration framework for cancer subtype classification and survival prediction. Brief Bioinform 2023;24:bbad304. 10.1093/bib/bbad304 [DOI] [PubMed] [Google Scholar]
- 10. Choi H, Pavelka N. When one and one gives more than two: Challenges and opportunities of integrative omics. Front Genet 2012;2:105. 10.3389/fgene.2011.00105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Gomez-Cabrero D, Abugessaisa I, Maier D. et al. Data integration in the era of omics: Current and future challenges. BMC Syst Biol 2014;8:I1–10. 10.1186/1752-0509-8-S2-I1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Hayes CN, Nakahara H, Ono A. et al. From omics to multi-omics: A review of advantages and tradeoffs. Genes (Basel) 2024;15:1551. 10.3390/genes15121551 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Suravajhala P, Goltsov A. Three grand challenges in high throughput omics technologies. Biomolecules. 2022;12:1238. 10.3390/biom12091238 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Chen R, Mias GI, Li-Pook-Than J. et al. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell. 2012;148:1293–307. 10.1016/j.cell.2012.02.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Wen B, Zeng WF, Liao Y. et al. Deep learning in proteomics. Proteomics. 2020;20:e1900335. 10.1002/pmic.201900335 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Yu Y, Mai Y, Zheng Y. et al. Assessing and mitigating batch effects in large-scale omics studies. Genome Biol 2024;25:254. 10.1186/s13059-024-03401-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Tong L, Mitchel J, Chatlin K. et al. Deep learning-based feature-level integration of multi-omics data for breast cancer patients survival analysis. BMC Med Inform Decis Mak 2020;20:1–12. 10.1186/s12911-020-01225-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer; 2009, 10.1007/978-0-387-84858-7 [DOI] [Google Scholar]
- 19. Mei B, Wang Z. An efficient method to handle the ‘large p, small n’ problem for genomewide association studies using Haseman–Elston regression. J Genet 2016;95:847–52. 10.1007/s12041-016-0705-3 [DOI] [PubMed] [Google Scholar]
- 20. Liu C, Jiang J, Gu J. et al. High-dimensional omics data analysis using a variable screening protocol with prior knowledge integration (SKI). BMC Syst Biol 2016;10:457–64. 10.1186/s12918-016-0358-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Huang Y, Zeng P, Zhong C. Classifying breast cancer subtypes on multi-omics data via sparse canonical correlation analysis and deep learning. BMC Bioinformatics. 2024;25:132. 10.1186/s12859-024-05749-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Kaddi CD, Wang MD. Models for predicting stage in head and neck squamous cell carcinoma using proteomic and transcriptomic data. IEEE J Biomed Health Inform 2015;21:246–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Saghapour E, Kermani S, Sehhati M. A novel feature ranking method for prediction of cancer stages using proteomics data. PloS One 2017;12:e0184203. 10.1371/journal.pone.0184203 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Shi Z, Wen B, Gao Q. et al. Feature selection methods for protein biomarker discovery from proteomics or multiomics data. Mol Cell Proteomics 2021;20:100083. 10.1016/j.mcpro.2021.100083 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Bhadra T, Mallik S, Hasan N. et al. Comparison of five supervised feature selection algorithms leading to top features and gene signatures from multi-omics data in cancer. BMC Bioinformatics. 2022;23:153. 10.1186/s12859-022-04678-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Perez-Riverol Y, Kuhn M, Vizcaíno JA. et al. Accurate and fast feature selection workflow for high-dimensional omics data. PloS One 2017;12:e0189875. 10.1371/journal.pone.0189875 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Li Y, Mansmann U, Du S. et al. Benchmark study of feature selection strategies for multi-omics data. BMC Bioinformatics. 2022;23:412. 10.1186/s12859-022-04962-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Yuan L, Sun S, Jiang Y. et al. scRGCL: A cell type annotation method for single-cell RNA-seq data using residual graph convolutional neural network with contrastive learning. Brief Bioinform 2024;26:bbae662. 10.1093/bib/bbae662 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Yuan L, Zhao L, Jiang Y. et al. scMGATGRN: A multiview graph attention network–based method for inferring gene regulatory networks from single-cell transcriptomic data. Brief Bioinform 2024;25:bbae526. 10.1093/bib/bbae526 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Yuan L, Xu Z, Meng B., et al. scAMZI: Attention-based deep autoencoder with zero-inflated layer for clustering scRNA-seq data. BMC Genomics 2025;26(1):350, 10.1186/s12864-025-11511-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Bao W, Liu Y, Chen B. Oral_voting_transfer: classification of oral microorganisms’ function proteins with voting transfer model. Front Microbiol 2024;14:1277121. 10.3389/fmicb.2023.1277121 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Chen B, Li N, Bao W. CLPr_in_ML: Cleft lip and palate reconstructed features with machine learning. Curr Bioinform 2025;20:179–93. 10.2174/0115748936330499240909082529 [DOI] [Google Scholar]
- 33. Crossland DL, Denning WL, Ang S. et al. Antitumor activity of CD56-chimeric antigen receptor T cells in neuroblastoma and SCLC models. Oncogene. 2018;37:3686–97. 10.1038/s41388-018-0187-2 [DOI] [PubMed] [Google Scholar]
- 34. Tokuda N, Zhang Q, Yoshida S. et al. Genetic mechanisms for the synthesis of fucosyl GM1 in small cell lung cancer cell lines. Glycobiology. 2006;16:916–25. 10.1093/glycob/cwl022 [DOI] [PubMed] [Google Scholar]
- 35. Huisman DH, Chatterjee D, Svoboda RA. et al. KSR2 promotes self-renewal and clonogenicity of small cell lung carcinoma. Mol Cancer Res 2025;23:640–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Villalobos-Manzo R, Ríos-Castro E, Hernández-Hernández JM. et al. Identification of transferrin receptor 1 (TfR1) overexpressed in lung cancer cells, and internalization of magnetic Au-CoFe2O4 core-shell nanoparticles functionalized with its ligand in a cellular model of small cell lung cancer (SCLC). Pharmaceutics. 2022;14:1715. 10.3390/pharmaceutics14081715 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Vandesompele J, De Preter K, Pattyn F. et al. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol 2002;3:research0034. 10.1186/gb-2002-3-7-research0034 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Capello M, Bantis LE, Scelo G. et al. Sequential validation of blood-based protein biomarker candidates for early-stage pancreatic cancer. J Natl Cancer Inst 2017;109:djw266. 10.1093/jnci/djw266 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Chen IM, Johansen AZ, Dehlendorff C. et al. Prognostic value of combined detection of serum IL6, YKL-40, and C-reactive protein in patients with unresectable pancreatic cancer. Cancer Epidemiol Biomarkers Prev 2020;29:176–84. 10.1158/1055-9965.EPI-19-0672 [DOI] [PubMed] [Google Scholar]
- 40. Lazar C, Gatto L, Ferro M. et al. Accounting for the multiple natures of missing values in label-free quantitative proteom-ics data sets to compare imputation strategies. J Proteome Res 2016;15:1116–25. 10.1021/acs.jproteome.5b00981 [DOI] [PubMed] [Google Scholar]
- 41. Huang L, Song M, Shen H. et al. Deep learning methods for omics data imputation. Biology. 2023;12:1313. 10.3390/biology12101313 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Wei R, Wang J, Su M. et al. Missing value imputation approach for mass spectrometry-based metabolomics data. Sci Rep 2018;8:663. 10.1038/s41598-017-19120-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Metropolis N, Rosenbluth AW, Rosenbluth MN. et al. Equation of state calculations by fast computing machines. J Chem Phys 1953;21:1087–92. 10.1063/1.1699114 [DOI] [PubMed] [Google Scholar]
- 44. Breitling R, Armengaud P, Amtmann A. et al. Rank products: A simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett 2004;573:83–92. 10.1016/j.febslet.2004.07.055 [DOI] [PubMed] [Google Scholar]
- 45. Fahrmann JF, Wasylishen AR, Pieterman CR. et al. Blood-based proteomic signatures associated with MEN1-related duodenopancreatic neuroendocrine tumor progression. J Clin Endocrinol Metab 2023;108:3260–71. 10.1210/clinem/dgad315 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Simulated code resulting in generating simulation data this work is publicly available at: https://github.com/lulusong512/Rank-Based-Learning.git. Access to real-world clinical data can be provided upon reasonable request, subject to approval and completion of the required inter-institutional Material Transfer Agreement (MTA).








