Machine Learning-Based Ensemble Feature Selection and Nested Cross-Validation for miRNA Biomarker Discovery in Usher Syndrome

. 2025 May 8;12(5):497. doi: 10.3390/bioengineering12050497

Algorithm 1: Ensemble feature selection with nested cross-validation

Input: miRNA expression dataset D ∈ ℝ^{N × F}, where N is the number of samples and F is the number of miRNA features.
Output:

•
Minimal miRNA feature set F_minimal
•
Best-performing model M*
•
Mean performance metrics across all validation sets

Step 1. Initialization
F_minimal ← ∅
M* ← None

Let p = \frac{N}{10}

(i.e., Leave-6-Out Cross-Validation when N = 60)
Generate p = 10 non-overlapping folds:

{(T_{i}, V_{i})}_{i = 1}^{p}

, where T_i ∈ ℝ^{(N − p) × F}, V_i ∈ ℝ^{p × F}
Step 2. Outer Cross-Validation (Leave-p-Out)
for each i ∈ {1, 2, …, p} do
Let T_i be the outer training set and V_i be the outer validation set
Step 3. Inner Cross-Validation (Stratified k-Fold) and Feature Selection

Split T_{i} into k stratified folds : {(t_{j}, v_{j})}_{j = 1}^{k}

for each j ∈ {1, 2, …, k} do
Feature Selection on t_j:
Apply RFE, Random Forest importance, LASSO, and SelectKBest
Model Training:
Train classifiers (LR, RF, SVM, XGBoost, AdaBoost) on t_j
Model Evaluation:
Evaluate on v_j using Accuracy, Sensitivity, Specificity, F1 Score, and AUC
Model Selection:
Choose best-performing model M_j
Update Feature Set:
Add features to F_minimal if selected in ≥ 3 inner folds
Select most frequent model across inner folds as:
M* = argmax_Mj (frequency of selection in inner folds)
Step 4. Model Validation on Outer Fold
Use M* and F_minimal to classify V_i
Evaluate performance using Accuracy, Sensitivity, Specificity, F1 Score, and AUC
Return:

-
F_minimal—final feature set
-
M*—best-performing model
-
Average performance metrics over all V_i, i = 1, 2, …, p