Skip to main content
PLOS One logoLink to PLOS One
. 2025 Nov 3;20(11):e0330050. doi: 10.1371/journal.pone.0330050

Resampling methods for class imbalance in clinical prediction models: A scoping review protocol

Osama Abdelhay 1,*, Adam Shatnawi 2, Hassan Najadat 2, Taghreed Altamimi 3
Editor: Hamed Tavolinejad4
PMCID: PMC12582444  PMID: 41183062

Abstract

Introduction

Class imbalance—where clinically important “positive” cases make up less than 30% of the dataset—systematically reduces the sensitivity and fairness of medical prediction models. Although data-level techniques, such as random oversampling, random undersampling, SMOTE, and algorithm-level approaches like cost-sensitive learning, are widely used, the empirical evidence on when these corrections improve model performance remains scattered across different diseases and modelling frameworks. This protocol outlines a scoping systematic review with meta-regression that will map and quantitatively summarise 15 years of research on resampling strategies in imbalanced clinical datasets, addressing a key methodological gap in reliable medical AI.

Methods and analysis

We will search MEDLINE, EMBASE, Scopus, Web of Science Core Collection, and IEEE Xplore, along with grey literature sources (medRxiv, arXiv, bioRxiv) for primary studies (2009–31 Dec 2024) that apply at least one resampling or cost-sensitive strategy to binary clinical prediction tasks with a minority-class prevalence of less than 30%. There will be no language restrictions. Two reviewers will screen records, extract data using a piloted form, and document the process in a PRISMA flow diagram. A descriptive synthesis will catalogue the clinical domain, sample size, imbalance ratio, resampling strategy, model type, and performance metrics where 10 or more studies report compatible AUCs. A random-effects mixed-effects meta-regression (logit-transformed AUC) will be used to examine the effect of moderators, including imbalance ratio, resampling strategy, model family, and sample size. Small-study effects will be assessed with funnel plots, Egger’s test, trim-and-fill, and weight-function models; influence diagnostics and leave-one-out analyses will evaluate robustness. Since this is a methodological review, formal clinical risk-of-bias tools are optional; instead, design-level screening, influence diagnostics, and sensitivity analyses will enhance transparency.

Discussion

By combining a comprehensive conceptual framework with quantitative estimates, this review aims to determine when data-level versus algorithm-level balancing leads to genuine improvements in discrimination, calibration, and cost-sensitive metrics across various medical fields. The findings will help researchers select concise, evidence-based methods for addressing imbalance, inform journal and regulatory reporting standards, and identify research gaps such as the under-reporting of calibration and misclassification costs, which must be addressed before balanced models can be reliably trusted in clinical practice.

Systematic review registration

INPLASY202550026.

Introduction

Medical prediction datasets often exhibit an imbalance, with the clinically important “positive” class making up less than 30% of observations. This skew systematically biases traditional (e.g., logistic regression) and modern machine-learning classifiers towards the majority class, reducing sensitivity for the minority group [13].

To mitigate this threat, a set of data-level resampling strategies—random oversampling (ROS), random undersampling (RUS), and the Synthetic Minority Oversampling Technique (SMOTE)—modifies the training data before modelling [1,4,5]. Although commonly used, ROS can cause overfitting due to duplicate instances, RUS may discard potentially informative data points, and SMOTE or its variants might generate unrealistic synthetic examples [69].

Evidence comparing resampling with alternative strategies remains inconclusive. An extensive systematic review showed no consistent performance advantage of machine learning over logistic regression when event-per-variable ratios were adequate [10]. Furthermore, simulation and empirical studies suggest that effective sample size planning, rather than aggressive post-hoc balancing, often negates the need for resampling [1116].

At the algorithm level, cost-sensitive learning directly penalises errors in the minority class and can outperform methods that operate at the data level; however, it is infrequently reported in medical AI research [4,17].

Developments in binary classification theory—from early statistical formulations to perceptrons, support vector machines, and boosted ensembles—highlight how model choice interacts with class distribution and cost structure [1821].

Across clinical class-imbalance settings, approaches span data-level resampling (random over/undersampling; SMOTE and variants), algorithm-level/cost-sensitive learning, and increasingly ensembles/transfer learning. Resampling can be helpful when minority events are scarce, but it may induce boundary distortion/overfitting. Cost-sensitive methods align optimisation with misclassification costs. Ensembles and transfer learning can improve robustness but add complexity and computational demands. Given heterogeneity in prevalence, thresholds, and reporting, we restrict quantitative pooling to ROC-AUC and synthesise PR-AUC, MCC, F1, calibration, and decision-analytic measures descriptively, with PR-AUC/MCC receiving greater interpretive weight under skew. A fuller comparative appraisal (pros/cons and clinical suitability) is deferred to the results paper, consistent with protocol scope [2224].

In this context, we will undertake a scoping systematic review with meta-regression to (i) map the resampling and cost-sensitive strategies employed in imbalanced medical datasets, (ii) quantify their effects on discrimination and calibration, and (iii) identify methodological moderators and research gaps. This protocol outlines the intended methods.

Objectives

Primary objective.

This study aims to assess whether, in clinical prediction studies with binary outcomes and a minority-class prevalence below 30%, applying data-level resampling or algorithm-level cost-sensitive strategies significantly improves model performance compared to training on the original imbalanced data.

Specific objectives.

  1. Evidence mapping – Catalogue the complete range of imbalance correction strategies, including oversampling, undersampling, hybrids, and weighted or focal-loss models, reported between 2009 and 2024. Also include the clinical domains, dataset sizes, imbalance ratios, and modelling frameworks for these strategies.

  2. Comparative effectiveness – Quantify and compare discrimination metrics (e.g., AUC, sensitivity, specificity) and, where available, the calibration metrics achieved by

    • ◦ oversampling,

    • ◦ undersampling,

    • ◦ hybrid pipelines, and

    • ◦ cost-sensitive algorithms,

against models trained without any balancing.

  1. Moderator analysis—Employing mixed-effects meta-regression, evaluate how study-level characteristics (imbalance ratio, sample size, number of predictors, model family, and clinical domain) impact the effectiveness of each imbalance-correction strategy.

  2. Assess bias and robustness by examining the effects of small studies, publication bias, and significant outliers through funnel-plot diagnostics, trim-and-fill, weight-function models, and leave-one-out analyses; assess how these factors affect pooled estimates.

  3. Methodological gap identification – Emphasise recurring pitfalls, such as neglecting calibration, misclassification costs, or external validation, and develop evidence-based recommendations for future research and reporting.

We hypothesise: (H1) Conditional on adequate sample size, resampling strategies (over/under/hybrid/SMOTE-type) do not improve predictive performance over no resampling in imbalanced binary clinical prediction tasks. (H2) Cost-sensitive methods outperform pure over/undersampling at IR < 10%; (H3) Hybrid (resampling+algorithmic) methods outperform single-strategy approaches; (H4) External validation yields lower AUC than internal; (H5) Studies reporting calibration perform better on net benefit where available.” (Exploratory if data sparse.);

Covariates: imbalance ratio, sample size, validation tier, clinical domain, leakage safeguards.

Methods

This protocol adheres to the PRISMA-P (S3 file) [25] and PRISMA-ScR [26] guidelines and has been registered with INPLASY (ID: INPLASY202550026) (S1 File). Any amendments will be recorded in the INPLASY record. Amendments (e.g., eligibility or analysis changes) will be logged with date, rationale, and impacted sections in a public registry (INPLASY/OSF) and cited in the final report.

Eligibility criteria (PICOTS)

  • Population: Clinical prediction studies that analyse binary outcomes with an explicit minority-class prevalence of less than 30%. For this review, a binary outcome is limited to diagnostic, prognostic, or treatment-response predictions in which the dependent variable has exactly two mutually exclusive states (e.g., disease present/absent).

  • Interventions: Data-level resampling (random oversampling, random undersampling, SMOTE or variants, hybrid pipelines) and algorithm-level cost-sensitive strategies (weighted losses, focal loss).

  • Comparators: Models trained on the original imbalanced data and/or alternative resampling or weighting strategies.

  • Outcomes: Primary—AUC; secondary—sensitivity, F1-score, specificity, balanced accuracy, calibration metrics, and reported mis-classification costs.

  • Timing: Publications from 1 Jan 2009–31 Dec 2024.

  • Study design includes retrospective or prospective primary studies (such as model-development and validation papers) and systematic reviews that reanalyse primary data. Excluded are simulation-only papers, non-binary tasks, or abstracts that lack methods. Studies focusing solely on radiomics, image-segmentation pipelines, or pixel-level classification tasks will also be excluded, as these do not produce patient-level binary predictions.

  • Scope exclusion (imaging segmentation/radiomics): We exclude pixel/voxel-level segmentation and radiomics tasks because they optimise dense, pixel-level predictions and are evaluated with overlap/shape metrics (e.g., Dice/Jaccard/Hausdorff), which are not commensurable with patient-level clinical prediction (e.g., ROC-AUC, PR-AUC, calibration) that is the focus of this review. Including segmentation would mix fundamentally different targets, class-imbalance structures, and metrics; therefore, such studies are out of scope. [27,28]

Information sources and search strategy

Searches will be conducted in MEDLINE (PubMed), EMBASE, Scopus, Web of Science Core Collection, and IEEE Xplore. A peer-reviewed strategy combines controlled vocabulary and free-text terms to address class imbalance, resampling, and clinical prediction; an example MEDLINE string is provided in the S2 File. No language limits were applied, but non-English full texts had to be translatable.

Grey literature (medRxiv/arXiv/bioRxiv/GitHub): We include medRxiv, arXiv, bioRxiv, and GitHub to (i) reduce publication bias/small-study effects by capturing studies not yet in indexed journals, as recommended by major evidence-synthesis guidance, and (ii) map rapidly evolving ML methods whose earliest public disclosure is often via preprints/code releases. To mitigate risks (variable peer review/reporting quality), we apply minimum reporting standards (TRIPOD+AI-aligned task clarity, data splits/leakage safeguards, model specification, performance reporting, and reproducibility) and versioning (latest preprint version; tagged GitHub commit). We will (a) label preprints/code-only sources explicitly, (b) exclude records failing minimum standards from the synthesis (retaining them in the PRISMA flow), and (c) run sensitivity analyses that exclude grey-literature records to assess their influence on conclusions. This approach follows PRISMA/PRISMA-ScR, which aims to map evidence while comprehensively managing transparently reporting quality.

We will screen these sources, but include a record in synthesis only if the minimum reporting is met:

  1. Predictive task clarity (target population, outcome definition, prediction horizon).

  2. Data & split transparency (source, inclusion/exclusion, train/validation/test strategy; leakage safeguards);

  3. Model specification (algorithms, hyperparameters, resampling/cost strategies);

  4. Performance reporting aligned with TRIPOD+AI (discrimination; threshold-dependent metrics when used; calibration if available) and, for imaging-AI studies, CLAIM elements as applicable;

  5. Reproducibility (accessible code or sufficient procedural detail to replicate). Records failing these are catalogued but excluded from synthesis (retained in PRISMA flow). [29,30]

Version control for preprints/GitHub: For preprints, we use the latest version at extraction. For GitHub, we require a tagged release/commit hash to ensure reproducibility. Reporting items and reproducibility checks for grey-literature records are aligned with TRIPOD+AI, where LLM-based prediction studies appear; TRIPOD-LLM items will be consulted when applicable. [31]

Study selection

Search results will be imported into Zotero for deduplication [32] and prioritised with ASReview [33]. Two reviewers will independently screen titles and abstracts, followed by full texts, resolving conflicts by consensus or through third-party adjudication. Reasons for exclusion will be recorded and displayed in a PRISMA flow diagram [25]. Data missing from the full text will be requested from authors (two-week window). We will detect duplicate/overlapping cohorts (e.g., preprint→journal of the same dataset) by matching data sources/time windows/outcomes and will retain the most complete, peer-reviewed record; secondary records contribute unique methodological details. A de-duplication table will document decisions. [34]

Data extraction

A standardised, pilot-tested form will record bibliometrics, clinical domain, sample size, imbalance ratio, resampling strategy, model family, performance metrics, calibration statistics, and cost-sensitive measures. Two independent reviewers will extract all items twice into a REDCap database (version 14.0.19). A third reviewer will run the comparison report, resolve discrepancies, and export a single verified dataset. Statistical analyses will be performed in R (v4.4.0) using the metafor (v4.8-0), dplyr (v1.1.4), and ggplot2 (v3.5.2) packages. After publication, all code and a session-info file will be uploaded to the OSF repository.

Outcomes and effect measures

  • a. Evaluation Metrics

Why is accuracy insufficient? In imbalanced settings, accuracy can be high while the minority class is poorly detected; we report it only for completeness.

Discrimination. We prioritise ROC-AUC for quantitative pooling due to ubiquity and cross-study comparability, while noting its optimistic behaviour under skewed prevalence. PR-AUC will be emphasised in interpretation because it reflects positive-class performance and is more informative in cases of imbalance. [35]

Threshold-dependent metrics: We will tabulate/visualise F1, sensitivity/specificity, and Matthew’s correlation coefficient (MCC); MCC provides a balanced assessment from the full confusion matrix and often outperforms accuracy/F1 in skewed data. These metrics will not be pooled because thresholds and prevalences vary across studies [36].

Calibration & decision impact: Calibration slope/intercept and Brier score will be summarised descriptively (no pooling). Where authors report decision-curve analysis (net benefit) or explicit misclassification costs, we will extract and summarise without imputing costs; multiple author-reported cost scenarios will be presented as sensitivity analyses. [3739]

  • b. Outcomes & Synthesis

Primary metric & pooling: only ROC-AUC will be meta-analysed (random-effects on logit-AUC). Pooling requires ≥5 clinically comparable studies (same target, prediction horizon, and validation tier). We summarise heterogeneity with τ² and I2; if I2 > 75% or subgroups are sparse/incoherent, we will not pool. [40]

Interpretive weighting under imbalance: while only ROC-AUC is pooled, PR-AUC and MCC will receive greater interpretive weight in narrative/visual synthesis for imbalanced datasets. [35,36]

When pooling is inappropriate (e.g., if criteria are unmet, such as sparse subgroups, incompatible outcomes, or overlapping cohorts), we will use a structured narrative following SWiM guidance, accompanied by standardised tables/figures. [40]

Risk-of-bias and methodological quality

Even though this is a methodological scoping review, we will apply a tailored quality checklist informed by TRIPOD+AI report items and PROBAST/PROBAST-AI domains (focus on reproducibility, data leakage safeguards, validation, and calibration reporting) to describe reporting quality and potential bias. [34]. We will apply design-level screening for reproducibility, influence diagnostics (Cook’s distance [41], studentized residuals [42]), and small-study-effect tests (funnel plot [42], Egger’s regression [43], and Vevea–Hedges’ weight function [44]) to inform sensitivity analyses. We will continue to assess whether studies report blinding, handle missing data, and provide external validation; we plan to incorporate these elements into a supplementary risk-of-bias table. Although methodological, we will apply a tailored checklist drawing on TRIPOD+AI (reporting) and PROBAST/PROBAST+AI domains (risk of bias/applicability). Results summarised narratively (no scoring).

Terminology and bias signals

To avoid ambiguity, we standardise terminology and use resampling strategy to denote data-level methods (random over-/undersampling, SMOTE variants, hybrids). We adopt the term “small-study effects” as an umbrella term for patterns whereby smaller studies report larger effects; such patterns can arise from publication bias, outcome-reporting bias, lower study quality, between-study heterogeneity, or chance. We will inspect funnel plot asymmetry and, where feasible, apply Egger’s test as a screening tool. Still, we will interpret asymmetry as evidence of small-study effects, rather than publication bias alone, and discuss plausible causes in context. [45,46]

How we’ll report it

Consistent with PRISMA 2020, we will report whether small-study effects were assessed, which methods were used (visual inspection, Egger’s test), and limitations of these tests. We will refrain from formal testing when subgroups contain too few studies (e.g., < 10), and will emphasise qualitative interpretation when power is low.

Data synthesis

Phase 1—Descriptive mapping: Tables and visualisations (e.g., heat maps, temporal plots) will summarise trends in resampling use, model type, imbalance severity, and performance.

Phase 2 — Quantitative synthesis: Random-effects meta-regression of logit-AUC will examine moderators (imbalance ratio, sample size, resampling strategy, model family). Pooling requires ≥5 clinically coherent studies (same target, horizon, validation tier). The REML estimator and Knapp-Hartung confidence intervals will be employed [42]. Heterogeneity will be assessed using τ² and I2 [42]; leave-one-out analyses will be used to test robustness. The analyses will be implemented in R (metafor, dplyr, ggplot2) [42]. If I2 is very high (≈>75%) or subgroups are sparse/incoherent, we will not pool and will follow SWiM for structured narrative synthesis. [40]

Subgroup and sensitivity analyses

Planned subgroup contrasts include oversampling versus undersampling, hybrid versus single-technique pipelines, cost-sensitive versus data-level only, high (>20%) versus very low (<5%) minority prevalence, and deep learning versus traditional models. Sensitivity analyses will exclude studies with high influence, those lacking external validation, and studies without calibration reporting. The imbalance ratio (IR) will be stratified a priori into four bins: very rare (< 5%), rare (5–10%), moderate (10–20%), and mild (20–30%) [6]. If any bin contains fewer than 10 studies, it will be merged with the next wider bin. For meta-regression, these bins will be dummy-coded (reference = mild), and IR will also be modelled as a restricted cubic spline to test linearity. Sensitivity analyses will replicate the model using two dichotomies (< 10% versus ≥ 10%; < 20% versus ≥ 20%).

Living review plan

Given the rapid methodological advances, automated database alerts will rerun the search on an annual basis. New eligible studies will be screened and, where appropriate, integrated into updated meta-analyses, with a version history logged transparently.

Discussion

Class imbalance remains one of the most persistent threats to safe clinical prediction: skewed data encourages algorithms to optimise overall accuracy at the expense of rare—but clinically essential—events. Algorithm-level approaches that embed explicit misclassification penalties can theoretically offset this bias [47,48]. Simultaneously, recent deep learning innovations such as deep belief networks and focal loss functions promise further gains in high-dimensional settings [49,50]. However, the empirical value of these strategies has never been systematically synthesised across the medical spectrum. Our planned scoping review with meta-regression addresses a critical methodological gap.

Anticipated challenges

  • Extreme heterogeneity: preliminary scoping indicates a broad dispersion in clinical domains, imbalance ratios, sample sizes, and metrics. Even when studies report AUC, converting to a common logit scale may not entirely harmonise differences in test–set construction and cross-validation folds.

  • Inconsistent reporting: fewer than one in ten studies in the initial screening publish calibration indices, and details of cost-sensitive losses are frequently relegated to supplementary code or omitted entirely.

  • Sparse external validation: Most papers evaluate performance using random internal splits; true generalisability remains uncertain.

  • Publication and small-study effects: Funnel plot asymmetry is anticipated, as smaller datasets often utilise aggressive oversampling, which skews apparent discrimination.

  • Metric multiplicity: sensitivity, specificity, F-score, precision-recall AUC, and balanced accuracy are reported idiosyncratically, complicating quantitative synthesis.

Strengths

  • The breadth of evidence: includes five bibliographic databases and grey literature repositories, encompassing 15 years of work and yielding the most extensive curated corpus of imbalance-related prediction studies.

  • Dual synthesis: A descriptive map is paired with a random-effects meta-regression that explores moderators such as imbalance severity, sample size, and model family, yielding detailed insights not available in narrative reviews.

  • Rigorous bias diagnostics: Influence statistics, funnel-plot tests, trim-and-fill, and Vevea–Hedges models will quantify the robustness of pooled estimates, alleviating the optimism that pervades the model-development literature.

  • Technology-enabled workflows: ML-assisted screening using ASReview accelerates and transparently documents selection decisions [33].

  • Alignment with contemporary guidance: Search, extraction, and reporting follow the PRISMA 2020 extensions to enhance reproducibility and uptake [25].

Limitations

Despite these safeguards, several constraints persist. First, residual heterogeneity is unavoidable; even a comprehensive meta-regression may elucidate only a modest fraction of the variance observed between studies. Second, using AUC as the primary effect size may risk neglecting threshold-dependent performance and real-world applications. Third, cost-sensitive studies might still be too few or inconsistently reported to facilitate quantitative pooling, necessitating a descriptive approach that limits formal comparisons through resampling. Fourth, living-review updates will depend on the speed at which newly published work reports compatible statistics—the review may lag very recent methodological advances.

Potential impact and influence on practice

By determining when and for whom resampling or weighting truly adds value, this review will assist data scientists in avoiding reflexive oversampling, which can obscure calibration or encourage overfitting. Evidence suggests that cost-sensitive losses rival data-level balancing, which could shift practice towards simpler, loss–function–centric pipelines readily available in mainstream frameworks [4750]. Clinicians and journal editors might utilise the findings to demand more comprehensive reporting of calibration, confusion matrices, and misclassification costs, thereby accelerating the adoption of emerging AI reporting extensions (e.g., TRIPOD-AI), see also [25]. Regulators may likewise refer to our recommendations when evaluating the fairness of deployed diagnostic or prognostic models.

Future directions

The mapped gaps suggest four priorities:

  1. Prospective, multi-centre cohorts with rare outcomes to test whether cost-sensitive and focal-loss networks outperform oversampling in truly out-of-sample settings.

  2. Standardised reporting templates that mandate disclosure of class distribution, sampling strategy, calibration, and decision-curve analysis; our findings can feed directly into upcoming guideline revisions.

  3. Generative augmentation and domain-adapted GANs: Early evidence (e.g., synthetic EEG and radiology data) hints at privacy-preserving promise but requires rigorous external validation [51].

  4. Continuous evidence surveillance through annual database alerts and semi-automated screening pipelines aligns with the living-review paradigm and ensures the conclusions remain current as new imbalance-handling techniques emerge [25,33].

The planned review will quantify the performance lift (or degradation) attributable to balancing strategies and outline a research agenda for more reproducible, cost-aware, and clinically grounded predictive modelling.

Supporting information

S1 File. INPLASY Protocol.

INPLASY Protocol Registration.

(DOCX)

pone.0330050.s001.docx (47.7KB, docx)
S2 File. Search Queries.

Ready-to-paste search queries with limit (2009–31st of Dec 2024).

(DOCX)

pone.0330050.s002.docx (16.1KB, docx)
S3 File. PRISMA-P Checklist.

PRISMA-P Checklist.

(DOCX)

pone.0330050.s003.docx (32.4KB, docx)

Data Availability

No datasets were generated or analysed during the current study. All relevant data from this study will be made available upon study completion.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Mena LJ, Gonzalez JA. Machine learning for imbalanced datasets: Application in medical diagnostic. FLAIRS. 2006. [Google Scholar]
  • 2.Li D-C, Liu C-W, Hu SC. A learning method for the class imbalance problem with medical data sets. Comput Biol Med. 2010;40(5):509–18. doi: 10.1016/j.compbiomed.2010.03.005 [DOI] [PubMed] [Google Scholar]
  • 3.Rahman MM, Davis DN. Addressing the Class Imbalance Problem in Medical Datasets. IJMLC. 2013;:224–8. doi: 10.7763/ijmlc.2013.v3.307 [DOI] [Google Scholar]
  • 4.Mienye ID, Sun Y. Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Informatics in Medicine Unlocked. 2021;25:100690. doi: 10.1016/j.imu.2021.100690 [DOI] [Google Scholar]
  • 5.Alahmari F. A Comparison of Resampling Techniques for Medical Data Using Machine Learning. J Info Know Mgmt. 2020;19(01):2040016. doi: 10.1142/s021964922040016x [DOI] [Google Scholar]
  • 6.Carvalho M, Pinho AJ, Brás S. Resampling approaches to handle class imbalance: a review from a data perspective. J Big Data. 2025;12(1). doi: 10.1186/s40537-025-01119-4 [DOI] [Google Scholar]
  • 7.Panjainam P, Kanjanawattana S. A Comparison of the Hybrid Resampling Techniques for Imbalanced Medical Data. In: Proceedings of the 2024 7th International Conference on Robot Systems and Applications. 2024. 46–50. 10.1145/3702468.3702477 [DOI]
  • 8.Jo T, Japkowicz N. Class imbalances versus small disjuncts. SIGKDD Explor Newsl. 2004;6(1):40–9. doi: 10.1145/1007730.1007737 [DOI] [Google Scholar]
  • 9.van den Goorbergh R, van Smeden M, Timmerman D, Van Calster B. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. J Am Med Inform Assoc. 2022;29(9):1525–34. doi: 10.1093/jamia/ocac093 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22. doi: 10.1016/j.jclinepi.2019.02.004 [DOI] [PubMed] [Google Scholar]
  • 11.Demidenko E. Sample size determination for logistic regression revisited. Stat Med. 2007;26(18):3385–97. doi: 10.1002/sim.2771 [DOI] [PubMed] [Google Scholar]
  • 12.Yenipınar A, Koç Ş, Çanga D, Kaya F. Determining sample size in logistic regression with G-Power. Black Sea J Eng Sci. 2019;2(1):16–22. [Google Scholar]
  • 13.Charan J, Kaur R, Bhardwaj P, Singh K, Ambwani SR, Misra S. Sample Size Calculation in Medical Research: A Primer. ANAMS. 2021;57:74–80. doi: 10.1055/s-0040-1722104 [DOI] [Google Scholar]
  • 14.Balki I, Amirabadi A, Levman J, Martel AL, Emersic Z, Meden B, et al. Sample-Size Determination Methodologies for Machine Learning in Medical Imaging Research: A Systematic Review. Can Assoc Radiol J. 2019;70(4):344–53. doi: 10.1016/j.carj.2019.06.002 [DOI] [PubMed] [Google Scholar]
  • 15.Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLoS One. 2019;14(11):e0224365. doi: 10.1371/journal.pone.0224365 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making. 2012;12:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Araf I, Idri A, Chairi I. Cost-sensitive learning for imbalanced medical data: a review. Artif Intell Rev. 2024;57(4). doi: 10.1007/s10462-023-10652-8 [DOI] [Google Scholar]
  • 18.Cox DR. The Regression Analysis of Binary Sequences. J Royal Statistical Society Series B: Statistical Methodology. 1959;21(1):238–238. doi: 10.1111/j.2517-6161.1959.tb00334.x [DOI] [Google Scholar]
  • 19.Block HD. The Perceptron: A Model for Brain Functioning. I. Rev Mod Phys. 1962;34(1):123–35. doi: 10.1103/revmodphys.34.123 [DOI] [Google Scholar]
  • 20.Stitson M, Weston J, Gammerman A, Vovk V, Vapnik V. Theory of support vector machines. University of London. 1996;117(827):188–91.
  • 21.Hastie T, Tibshirani R, Friedman J. Boosting and additive trees. 2009. [Google Scholar]
  • 22.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. jair. 2002;16:321–57. doi: 10.1613/jair.953 [DOI] [Google Scholar]
  • 23.Branco P, Torgo L, Ribeiro RP. A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput Surv. 2016;49(2):1–50. doi: 10.1145/2907070 [DOI] [Google Scholar]
  • 24.Haibo He, Garcia EA. Learning from Imbalanced Data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84. doi: 10.1109/tkde.2008.239 [DOI] [Google Scholar]
  • 25.Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. doi: 10.1136/bmj.n71 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann Intern Med. 2018;169(7):467–73. doi: 10.7326/M18-0850 [DOI] [PubMed] [Google Scholar]
  • 27.Müller D, Soto-Rey I, Kramer F. Towards a guideline for evaluation metrics in medical image segmentation. BMC Res Notes. 2022;15(1):210. doi: 10.1186/s13104-022-06096-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Reinke A, Tizabi MD, Baumgartner M, Eisenmann M, Heckmann-Nötzel D, Kavur AE, et al. Understanding metric-related pitfalls in image analysis validation. Nat Methods. 2024;21(2):182–94. doi: 10.1038/s41592-023-02150-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. doi: 10.1136/bmj-2023-078378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Tejani AS, Klontzas ME, Gatti AA, Mongan JT, Moy L, Park SH, et al. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): 2024 Update. Radiol Artif Intell. 2024;6(4):e240300. doi: 10.1148/ryai.240300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025;31(1):60–9. doi: 10.1038/s41591-024-03425-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Zotero. 7.0.15 ed. Vienna, VA USA: Corporation for Digital Scholarship. 2025. [Google Scholar]
  • 33.van de Schoot R, de Bruin J, Schram R, Zahedi P, de Boer J, Weijdema F, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell. 2021;3(2):125–33. doi: 10.1038/s42256-020-00287-7 [DOI] [Google Scholar]
  • 34.Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann Intern Med. 2019;170(1):51–8. doi: 10.7326/M18-1376 [DOI] [PubMed] [Google Scholar]
  • 35.Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. doi: 10.1371/journal.pone.0118432 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. doi: 10.1186/s12864-019-6413-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. doi: 10.1186/s12916-019-1466-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Binuya MAE, Engelhardt EG, Schats W, Schmidt MK, Steyerberg EW. Methodological guidance for the evaluation and updating of clinical prediction models: a systematic review. BMC Med Res Methodol. 2022;22(1):316. doi: 10.1186/s12874-022-01801-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565–74. doi: 10.1177/0272989X06295361 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Campbell M, McKenzie JE, Sowden A, Katikireddi SV, Brennan SE, Ellis S, et al. Synthesis without meta-analysis (SWiM) in systematic reviews: reporting guideline. BMJ. 2020;368:l6890. doi: 10.1136/bmj.l6890 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Cook RD. Detection of influential observation in linear regression. Technometrics. 1977;19(1):15–8. [Google Scholar]
  • 42.Harrer M, Cuijpers P, Furukawa T, Ebert D. Doing Meta-Analysis with R: A Hands-On Guide. Boca Raton (FL): Chapman & Hall/CRC. 2021. [Google Scholar]
  • 43.Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997;315(7109):629–34. doi: 10.1136/bmj.315.7109.629 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Vevea JL, Hedges LV. A General Linear Model for Estimating Effect Size in the Presence of Publication Bias. Psychometrika. 1995;60(3):419–35. doi: 10.1007/bf02294384 [DOI] [Google Scholar]
  • 45.Sterne JA, Egger M, Smith GD. Systematic reviews in health care: Investigating and dealing with publication and other biases in meta-analysis. BMJ. 2001;323(7304):101–5. doi: 10.1136/bmj.323.7304.101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Page MJ, Moher D, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ. 2021;372:n160. doi: 10.1136/bmj.n160 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Elkan C. The foundations of cost-sensitive learning. Int Joint Conference Artificial Intelligence. Lawrence Erlbaum Associates Ltd. 2001. [Google Scholar]
  • 48.Zhi-Hua Zhou, Xu-Ying Liu. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng. 2006;18(1):63–77. doi: 10.1109/tkde.2006.17 [DOI] [Google Scholar]
  • 49.Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006;18(7):1527–54. doi: 10.1162/neco.2006.18.7.1527 [DOI] [PubMed] [Google Scholar]
  • 50.Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. 2017.
  • 51.Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform. 2018;22(5):1589–604. doi: 10.1109/JBHI.2017.2767063 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Hamed Tavolinejad

14 Sep 2025

Dear Dr. Abdelhay,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Oct 29 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Hamed Tavolinejad

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf   and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have indicated that there are restrictions to data sharing for this study. For studies involving human research participant data or other sensitive data, we encourage authors to share de-identified or anonymized data. However, when data cannot be publicly shared for ethical reasons, we allow authors to make their data sets available upon request. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

Before we proceed with your manuscript, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., a Research Ethics Committee or Institutional Review Board, etc.). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of recommended repositories, please see https://journals.plos.org/plosone/s/recommended-repositories. You also have the option of uploading the data as Supporting Information files, but we would recommend depositing data directly to a data repository if possible.

Please update your Data Availability statement in the submission form accordingly.

3. When completing the data availability statement of the submission form, you indicated that you will make your data available on acceptance. We strongly recommend all authors decide on a data sharing plan before acceptance, as the process can be lengthy and hold up publication timelines. Please note that, though access restrictions are acceptable now, your entire data will need to be made freely accessible if your manuscript is accepted for publication. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If you are unable to adhere to our open data policy, please kindly revise your statement to explain your reasoning and we will seek the editor's input on an exemption. Please be assured that, once you have provided your new statement, the assessment of your exemption will not hold up the peer review process.

4 . If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise. 

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does the manuscript provide a valid rationale for the proposed study, with clearly identified and justified research questions?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Is the protocol technically sound and planned in a manner that will lead to a meaningful outcome and allow testing the stated hypotheses??>

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Is the methodology feasible and described in sufficient detail to allow the work to be replicable??>

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors described where all data underlying the findings will be made available when the study is complete??>

The PLOS Data policy

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

Please use the space provided to explain your answers to the questions above and, if applicable, provide comments about issues authors must address before this protocol can be accepted for publication. You may also include additional comments for the author, including concerns about research or publication ethics.

You may also provide optional suggestions and comments to authors that they might find helpful in planning their study.

Reviewer #1: This is an exemplary and timely study protocol that addresses the critical methodological challenge of class imbalance in clinical prediction models. The authors have proposed an exceptionally rigorous and transparent plan, adhering to the highest standards for systematic reviews (e.g., PRISMA-P/ScR, INPLASY pre-registration). The comprehensive search strategy and robust plan for a dual descriptive and quantitative synthesis are major strengths.

The protocol is nearly ready for publication. I have only a few minor suggestions to enhance its clarity and precision before it is finalized:

Clarity on Calibration Metrics: The protocol mentions collecting calibration slope and Brier score. To strengthen the pre-specified analysis plan, please consider explicitly stating how these metrics will be synthesized (e.g., will it be a descriptive summary only, or will they be pooled if sufficient homogeneity is found?).

Justification for Scope Exclusion: The exclusion of radiomics and image-segmentation tasks is a sensible scoping decision. A brief justification in the eligibility criteria section (e.g., clarifying that pixel-level imbalance is conceptually different from the patient-level focus of this review) would strengthen the manuscript.

Terminology Consistency: For consistency, I recommend standardizing terms. For example, "resampling class" and "resampling strategy" are both used; standardizing to one term would be ideal. Similarly, please clarify if "small-study effects" and "publication bias" will be reported as distinct concepts or if one term will be used to encompass the analysis.

This is an outstanding protocol for a review that will be a significant contribution to the field of medical AI. I commend the authors on their meticulous work and look forward to seeing the results of the completed study. With the minor clarifications outlined above, I believe this protocol will be ready for publication.

Reviewer #2: Overview:

The manuscript addresses a critical challenge in clinical prediction modeling: managing class imbalance in datasets. This is highly relevant, as rare outcomes are common in healthcare and can lead to biased or underperforming predictive models if not properly addressed. The manuscript provides a clear overview of strategies such as resampling, algorithm-level adjustments, and synthetic data generation.

Strengths of the clinical report:

The topic is timely and clinically important.

Provides a comprehensive review of methods for handling class imbalance.

Well-organized and clearly presented, making complex concepts accessible.

Major Concerns:

Evaluation Metrics: The manuscript should discuss metrics better suited for imbalanced data, such as precision-recall curves, F1 score, and Matthews correlation coefficient, as accuracy alone may be misleading.

Clinical Relevance: Including specific clinical examples or datasets would make the discussion more concrete and applicable to real-world scenarios.

Methodological Transparency: If novel methods or experimental comparisons are presented, more detail on preprocessing, parameter choices, and implementation is needed to allow reproducibility.

Ethical Considerations: Briefly addressing patient privacy, potential biases, and data-sharing limitations would strengthen the manuscript.

Suggestions for Improvement:

Include a summary table of methods with pros, cons, and clinical applicability.

Discuss trade-offs of oversampling, undersampling, and synthetic data approaches, including risks like overfitting.

Consider adding a brief section on emerging approaches such as ensemble learning or transfer learning for imbalanced clinical datasets.

Reviewer #3: Review Comments to the Author (Major Revisions Suggested):

(a) Clarify Handling of Calibration and Cost-Sensitive Metrics

While AUC is the chosen primary outcome, calibration and cost-sensitive performance are critical in imbalanced clinical datasets. At present, you acknowledge under-reporting but do not specify how such incomplete reporting will be addressed in synthesis. Please outline in more detail whether you plan descriptive-only mapping, imputation strategies, or sensitivity analyses when calibration/misclassification costs are missing.

(b) Address Heterogeneity in Meta-Regression

The anticipated heterogeneity across clinical domains, imbalance ratios, and study designs is very high. Although you plan random-effects meta-regression, the current description does not clarify how you will deal with extremely sparse subgroups or high I² values. Please expand on thresholds for deciding when quantitative synthesis is inappropriate, and what fallback narrative synthesis strategy you will adopt.

(c) Define Inclusion/Exclusion for Grey Literature

The search strategy includes medRxiv, arXiv, bioRxiv, and GitHub. These sources often contain incomplete or non–peer reviewed manuscripts. Please provide stricter criteria (e.g., methodological completeness, minimum reporting standards) for including such grey-literature studies in synthesis to ensure transparency and quality.

(d) Clarify Treatment of Duplicates and Overlapping Data

Given multiple publications may analyze overlapping datasets (e.g., the same hospital EHR in preprints and journals), please specify how you will identify and handle duplicate or overlapping cohorts to avoid double counting.

(e) Protocol Transparency and Amendments

Although registered in INPLASY, the manuscript should provide more detail on how protocol amendments (e.g., changing inclusion criteria, new statistical models) will be documented and justified. Please strengthen this section for reproducibility.

(f) Consider Broader Performance Metrics Beyond AUC

AUC is threshold-independent but often criticized in imbalanced data. Since your secondary outcomes include sensitivity, specificity, F1, and calibration, you should justify why AUC was prioritized and describe how threshold-dependent metrics will be weighted in interpretation.

(g) Improve Readability for Non-Methodological Readers

Some sections (especially the statistical analysis plan) are dense and jargon-heavy. Please consider simplifying key explanations or moving highly technical details (e.g., restricted cubic splines, influence diagnostics) to supplementary materials, while keeping the main text more accessible.

(h) Risk of Bias and Quality Assessment

The authors state that study-level bias tools like PROBAST are “optional” because this is a methodological review. However, lack of structured bias assessment is a significant weakness. Even methodological studies can vary in reporting quality, data leakage, or reproducibility. At minimum, they should commit to applying a tailored bias/quality checklist to ensure study validity.

(i) Unclear Hypothesis Formulation

Although the aims are well listed, the protocol does not clearly articulate testable hypotheses (e.g., “cost-sensitive learning performs better than oversampling under X imbalance ratios”). Instead, it states broad objectives. For a study planning meta-regression, more explicit hypotheses would improve methodological coherence.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #1: No

Reviewer #2: Yes:  Adekunle Adeoye

Reviewer #3: Yes:  Shake Ibna Abir

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2025 Nov 3;20(11):e0330050. doi: 10.1371/journal.pone.0330050.r002

Author response to Decision Letter 1


19 Sep 2025

Re: PONE-D-25-36127 — “Resampling Methods for Class Imbalance in Clinical Prediction Models: A Scoping Review Protocol”

Dear Dr. Tavolinejad and Reviewers,

Thank you for your careful evaluation of our protocol and for the constructive, detailed feedback. We appreciate the opportunity to revise and have addressed all editorial requirements and reviewer comments point-by-point in the sections that follow. We are submitting: (i) a clean revised manuscript, (ii) a tracked-changes version, and (iii) this rebuttal letter.

We aligned the submission with PLOS ONE formatting and file-naming guidelines (title/author/affiliation page and main-body templates) and ensured consistency with the journal’s submission instructions.

We also updated the Data Availability Statement suitable for a protocol article (no data reported at this stage), while reaffirming our plan for open sharing of de-identified materials with the publication of results, in line with PLOS’s data policy.

Key revisions (overview). In response to the reviewers, we:

• Clarified the synthesis plan: meta-analyse ROC-AUC only (random-effects on logit-AUC) under pre-specified coherence/heterogeneity thresholds; when pooling is inappropriate, we follow SWiM for transparent narrative synthesis (with τ², I², and prediction intervals reported).

• Specified handling of calibration and decision-analytic metrics: calibration slope/intercept and Brier score summarised descriptively; no imputation of misclassification costs; decision-curve/net-benefit reported when available.

• Justified metric choices under imbalance: AUC chosen for pooling due to ubiquity/comparability; PR-AUC and MCC elevated in interpretation; threshold-dependent metrics are not pooled.

• Standardised terminology: use “resampling strategy” throughout; interpret funnel-plot asymmetry/Egger’s test as small-study effects rather than publication bias alone.

• Defined scope boundaries: explicitly exclude radiomics/image-segmentation tasks (pixel/voxel-level targets and metrics) to maintain focus on patient-level clinical prediction.

• Strengthened grey-literature policy and justification: include medRxiv/arXiv/bioRxiv/GitHub to reduce bias and map fast-moving ML methods, with minimum reporting standards, versioning, explicit labelling, and leave-out sensitivity analyses. We align reporting with PRISMA-P for protocols and will transparently document any amendments (INPLASY/OSF).

• Quality/bias assessment: commit to a tailored methodological quality checklist (TRIPOD+AI/PROBAST-informed) focusing on reproducibility, leakage safeguards, validation, and calibration reporting.

• Readability: simplified the main text for non-methodological readers and moved technical derivations to the Supplement.

We appreciate the reviewers’ insightful suggestions; we believe these revisions improve the protocol’s clarity, methodological rigour, and alignment with open science and reporting standards (PLOS policy, PRISMA-P, SWiM).

We hope the revised manuscript meets the journal’s expectations and sincerely thank you for your time and consideration.

With best regards,

Osama Abdelhay, on behalf of all authors

Point-by-point response

Reviewer #1

Comment 1: Clarity on Calibration Metrics: The protocol mentions collecting the calibration slope and the Brier score. To strengthen the pre-specified analysis plan, please consider explicitly stating how these metrics will be synthesised (e.g., will it be a descriptive summary only, or will they be pooled if sufficient homogeneity is found?).

Response: Thank you. Confirmed descriptive-only treatment of calibration metrics with narrative/visual synthesis; pooling is limited to ROC-AUC under pre-specified criteria; narrative synthesis follows SWiM when pooling is inappropriate.[1]

Comment 2: Justification for Scope Exclusion: The exclusion of radiomics and image-segmentation tasks is a sensible scoping decision. A brief justification in the eligibility criteria section (e.g., clarifying that pixel-level imbalance is conceptually different from the patient-level focus of this review) would strengthen the manuscript.

Response: Thank you. We added an explicit justification in the Eligibility criteria, clarifying that segmentation and radiomics are pixel-level tasks evaluated with Dice/Jaccard/Hausdorff, which are not commensurable with our patient-level prediction focus; thus, they are out of scope.

Comment 3: Terminology Consistency: For consistency, I recommend standardising terms. For example, "resampling class" and "resampling strategy" are both used; standardising to one term would be ideal. Similarly, please clarify if "small-study effects" and "publication bias" will be reported as distinct concepts or if one term will be used to encompass the analysis.

Response: Thank you. We harmonised terminology: we now use “resampling strategy” throughout and adopt “small-study effects” as the umbrella term, clarifying its relation to publication bias and other mechanisms. We also specify that funnel-plot asymmetry/Egger’s test will be interpreted as small-study effects, not publication bias alone, per Cochrane/PRISMA guidance.

Reviewer #2

Comment 1: Evaluation Metrics: The manuscript should discuss metrics better suited for imbalanced data, such as precision-recall curves, F1 score, and Matthews correlation coefficient, as accuracy alone may be misleading. Clinical Relevance: Including specific clinical examples or datasets would make the discussion more concrete and applicable to real-world scenarios.

Response: We rewrote “Evaluation Metrics” to de-emphasise accuracy and foreground PR-AUC, F1, MCC, and calibration. We restrict pooling to ROC-AUC for comparability and treat other metrics descriptively; PR-AUC/MCC receive greater interpretive weight in imbalanced settings. Citations added (Saito & Rehmsmeier 2015; Chicco & Jurman 2020; Van Calster et al.; Vickers & Elkin).[2, 3]

Comment 2: Methodological Transparency: If novel methods or experimental comparisons are presented, more detail on preprocessing, parameter choices, and implementation is needed to allow reproducibility. Ethical Considerations: Briefly addressing patient privacy, potential biases, and data-sharing limitations would strengthen the manuscript.

Response: Thank you. We added a brief ethics/data-sharing paragraph consistent with journal policy.

Comment 3: Suggestions for Improvement: Include a summary table of methods with pros, cons, and clinical applicability. Discuss trade-offs of oversampling, undersampling, and synthetic data approaches, including risks like overfitting. Consider adding a brief section on emerging approaches, such as ensemble learning or transfer learning, for imbalanced clinical datasets.

Response: Thank you. We appreciate the suggestion. Since this manuscript is a protocol, we avoid including an evaluative table at this stage. Instead, we added a concise Methods landscape paragraph that outlines resampling, cost-sensitive learning, and emerging ensemble/transfer learning approaches, and we clarified our synthesis stance (pool ROC-AUC only; interpret PR-AUC/MCC/calibration/DCA descriptively). A detailed pros and cons comparison will be provided in the results article.

Reviewer #3

Comment 1: Clarify Handling of Calibration and Cost-Sensitive Metrics. While AUC is the primary chosen outcome, calibration and cost-sensitive performance are also critical in imbalanced clinical datasets. At present, you acknowledge under-reporting but do not specify how such incomplete reporting will be addressed in synthesis. Please outline in more detail whether you plan descriptive-only mapping, imputation strategies, or sensitivity analyses when calibration/misclassification costs are missing.

Response: Thank you. Clarified that calibration (slope/intercept, Brier) and decision-analytic metrics (net benefit/DCA, misclassification costs) will be mapped and summarised descriptively, with no imputation; multiple author-reported cost scenarios will be reported as sensitivity analyses.[4, 5]

Comment 2: Address Heterogeneity in Meta-Regression. The anticipated heterogeneity across clinical domains, imbalance ratios, and study designs is very high. Although you plan a random-effects meta-regression, the current description does not clarify how you will deal with extremely sparse subgroups or high I² values. Please expand on thresholds for deciding when quantitative synthesis is inappropriate, and what fallback narrative synthesis strategy you will adopt.

Response: Thank you. Added explicit thresholds and a SWiM-based fallback; prediction intervals and τ²/I² will be reported.[1]

Comment 3: Define Inclusion/Exclusion for Grey Literature. The search strategy includes medRxiv, arXiv, bioRxiv, and GitHub. These sources often contain incomplete or non–peer–reviewed manuscripts. Please provide stricter criteria (e.g., methodological completeness, minimum reporting standards) for including such grey-literature studies in synthesis to ensure transparency and quality.

Response: Thank you. In addition to specifying minimum reporting criteria and version control, we now justify grey-literature inclusion as follows: (i) it reduces publication bias and enables a comprehensive map of methods, consistent with Cochrane/PRISMA guidance; (ii) it is particularly important for fast-moving ML where preprints/code are primary disclosures. We will label all grey-literature records, exclude those failing minimum standards from synthesis (kept in PRISMA flow), and perform leave-out sensitivity analyses to assess their impact. Our extraction is aligned with TRIPOD+AI, and TRIPOD-LLM will be consulted where relevant.

Comment 4: Clarify Treatment of Duplicates and Overlapping Data. Given that multiple publications may analyse overlapping datasets (e.g., the same hospital EHR in preprints and journals), please specify how you will identify and handle duplicate or overlapping cohorts to avoid double-counting.

Response: Thank you. We added a de-duplication procedure for overlapping cohorts. This procedure is documented in the study selection section.

Comment 5: Protocol Transparency and Amendments Although registered in INPLASY, the manuscript should provide more detail on how protocol amendments (e.g., changing inclusion criteria, new statistical models) will be documented and justified. Please strengthen this section for reproducibility.

Response: Thank you. Added a date-stamped amendment policy in the methods section per PRISMA-P guidance.

Comment 6: Consider Broader Performance Metrics Beyond AUC. AUC is threshold-independent but often criticised in imbalanced data. Since your secondary outcomes include sensitivity, specificity, F1, and calibration, you should justify why AUC was prioritised and describe how threshold-dependent metrics will be weighted in interpretation.

Response: Thank you. Justified ROC-AUC as the sole pooled metric (ubiquity/comparability) while elevating PR-AUC and MCC in interpretation for imbalanced data; threshold-dependent metrics are not pooled due to non-commensurable thresholds/prevalence.

Comment 7: Improve Readability for Non-Methodological Readers. Some sections (especially the statistical analysis plan) are dense and jargon-heavy. Please consider simplifying key explanations or moving highly technical details (e.g., restricted cubic splines, influence diagnostics) to supplementary materials, while keeping the main text more accessible.

Response: Thank you. We tried to simplify the main text, especially the statistical analysis part, as much as possible to accommodate non-methodological readers.

Comment 8: Risk of Bias and Quality Assessment The authors state that study-level bias tools like PROBAST are “optional” because this is a methodological review. However, the lack of structured bias assessment is a significant weakness. Even methodological studies can vary in reporting quality, data leakage, or reproducibility. At a minimum, they should commit to applying a tailored bias/quality checklist to ensure study validity.

Response: Thank you. Committed to a structured quality/bias description using TRIPOD+AI and PROBAST(+AI). This commitment is documented in the Risk-of-bias section.

Comment 9: Unclear Hypothesis Formulation Although the aims are well listed, the protocol does not clearly articulate testable hypotheses (e.g., “cost-sensitive learning performs better than oversampling under X imbalance ratios”). Instead, it states broad objectives. For a study planning meta-regression, more explicit hypotheses would improve methodological coherence.

Response: Thank you. Added explicit, testable hypotheses with planned covariates, end of the objectives section.

References

1. Campbell M, McKenzie JE, Sowden A, Katikireddi SV, Brennan SE, Ellis S, et al. Synthesis without meta-analysis (SWiM) in systematic reviews: reporting guideline. Bmj. 2020;368:l6890.

2. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6.

3. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one. 2015;10(3):e0118432.

4. Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Medicine. 2019;17(1):230.

5. Vickers AJ, van Calster B, Steyerberg EW. A simple, step-by-step guide to interpreting decision curve analysis. Diagnostic and Prognostic Research. 2019;3(1):18.

Attachment

Submitted filename: Response to Reviewers.docx

pone.0330050.s004.docx (29.2KB, docx)

Decision Letter 1

Hamed Tavolinejad

12 Oct 2025

Resampling Methods for Class Imbalance in Clinical Prediction Models: A Scoping Review Protocol

PONE-D-25-36127R1

Dear Dr. Abdelhay,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support .

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Hamed Tavolinejad, MD

Academic Editor

PLOS ONE

Acceptance letter

Hamed Tavolinejad

PONE-D-25-36127R1

PLOS ONE

Dear Dr. Abdelhay,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Hamed Tavolinejad

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. INPLASY Protocol.

    INPLASY Protocol Registration.

    (DOCX)

    pone.0330050.s001.docx (47.7KB, docx)
    S2 File. Search Queries.

    Ready-to-paste search queries with limit (2009–31st of Dec 2024).

    (DOCX)

    pone.0330050.s002.docx (16.1KB, docx)
    S3 File. PRISMA-P Checklist.

    PRISMA-P Checklist.

    (DOCX)

    pone.0330050.s003.docx (32.4KB, docx)
    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0330050.s004.docx (29.2KB, docx)

    Data Availability Statement

    No datasets were generated or analysed during the current study. All relevant data from this study will be made available upon study completion.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES