Abstract
K-fold cross-validation is a widely used technique for estimating the generalisation of the performance of supervised machine learning models. However, the effect of the number of folds (k) on bias-variance behaviour across models and datasets is not fully understood. This study examines how varying k, from 3 to 20, relates to estimates of bias and variance across four classification algorithms, evaluated on twelve datasets of varying sizes. These four algorithms are Support Vector Machine (SVM), Decision Tree (DT), Logistic Regression (LR), and k-Nearest Neighbours (KNN). We operationalise bias as the difference between the mean cross-validated training accuracy and the held-out test accuracy, and variance as the variability of accuracy across folds. Across all algorithms and datasets considered, variance increased as k grew, indicating that larger k values can yield less stable fold-to-fold estimates in our setting. Bias trends were algorithm- and dataset-dependent: KNN and SVM most frequently showed upward bias with increasing k, whereas DT was comparatively balanced, and LR showed mixed patterns. These findings, while limited to the models, metrics, and datasets studied, suggest that default choices of fixed k (e.g., 5 or 10) may not be universally optimal. We provide code and data preprocessing scripts to enable full replication and encourage further investigation into adaptive, model- and data-sensitive validation strategies.
Keywords: K-fold cross-validation, Bias-variance trade-off, And supervised machine learning
Subject terms: Applied mathematics, Computational science, Statistics
Introduction
Machine learning (ML) is a branch of computer science and a subset of artificial intelligence that aims to learn patterns from data to analyse, draw inferences, and improve performance at various tasks1. Since its inception, ML has revolutionised the health, finance, and manufacturing fields by enabling more efficient data analysis, predictive modelling, and automated decision-making2–5. A crucial step in building an ML model is ensuring a reliable and robust validation process. The validation process evaluates the performance of a model and helps estimate how well a trained model will generalise to unseen data. Different validation processes exist to enhance the performance of an ML model, such as bootstrapping, jackknifing, and cross-validation for non-time series data6–8, which are commonly used for non-time series data. For time-series data, specialised validation methods like walk-forward or blocked cross-validation are typically employed9. Despite various validation processes, ML models still face challenges, such as overfitting, underfitting, high computational cost, high bias and variance.
Among the validation strategies, k-fold cross-validation is one of the most widely adopted methods for estimating predictive accuracy10. Despite its popularity, the influence of the number of folds (k) on bias and variance remains insufficiently characterised in practice. This gap is critical because bias–variance dynamics directly affect model selection and decision-making in high-stakes domains such as infrastructure risk assessment. Existing literature provides limited guidance on this issue11. While common guidance often recommends k values of 5 or 1012, these rules of thumb are largely empirical and may not account for dataset size or algorithmic behaviour. In applied settings, fixed choices of k are therefore sometimes used without assessing their impact on stability and generalisation. This work empirically examines how varying k relates to bias and variance across different algorithms and dataset scales.
This study focuses on the cross-validation problem within the broader context of model validation strategies, with particular emphasis on k-fold cross-validation. The primary objective is to investigate how varying the number of folds (k) influences bias and variance in supervised machine learning models. In particular, it aims to address these two research questions: (a) How does varying the number of folds (k) affect bias and variance across different supervised learning algorithms?; and (b) To what extent does dataset size moderate these effects?. To address these questions, we conduct a systematic analysis across four widely used algorithms: Support Vector Machine (SVM), Decision Tree (DT), Logistic Regression (LR), and k-Nearest Neighbours (KNN).
This work makes three contributions. First, it provides a comprehensive empirical analysis of bias–variance behaviour under varying k across four supervised learning algorithms and datasets of diverse sizes. Second, it presents evidence that, in our experiments, increasing k was associated with higher variance, and for some algorithms, higher bias, indicating that larger k does not uniformly improve estimate stability. Third, it offers practical, cautionary guidance for selecting k in real-world workflows, motivating adaptive choices that consider model and data characteristics.
Background
Related work
The k-fold cross-validation is a crucial method in supervised ML for estimating performance. Despite its importance, a comprehensive investigation into the influence of k value in k-fold cross-validation outcomes across various supervised ML models has resulted in only a few studies. A literature review indicates that the most commonly used values of k in k-fold cross-validation are either five13–15 or ten16–19. Recent studies have further examined the statistical properties and practical implications of k selection, highlighting its influence on bias–variance dynamics and model stability20–22. These two values are considered to have the best test error estimates, which do not exhibit high bias or high variance23. Nevertheless, no official guidelines exist on which value to use or how different values affect the bias and variance of various models and data sizes. This section reviews all related articles on the effects of k value on model bias and variance.
Nti et al.23 investigated the optimal value of k in k-fold cross-validation and its impact on the validation performance of several ML models using six distinct k values (3, 5, 7, 10, 15 and 20). This study revealed that the gradient boosting and k-nearest neighbour had optimal accuracy at k = 7. For DT, it is 15. LR showed no significant change with varying k values, though k = 3 had slightly higher accuracy. Another study used seven levels of k value (2, 5, 10, 20, n-5, n-2 and n-1) in k-fold cross-validation to evaluate the effects on six Bayesian Network models21. Their findings agreed with the consensus, where k = 5 is the most optimal for a large dataset (n = 5000), and k = 10 is optimal for small samples (n = 50), but uniquely highlighted k = 2 as sufficient for medium-sized datasets (n = 500). Rodriguez et al.24 analysed the statistical properties of bias and variance in error estimation for k-fold cross-validation using Naïve Bayes and k-nearest neighbour algorithms. For the k-fold cross-validation, they used k values of 2, 5, and 10, as well as n, where n represents the sample size. They recommended using a k value of 5 or 10 if the aim is to measure prediction error, further reaffirming the consensus. Another study researched a modified approach to k-fold cross-validation, whose result challenged the consensus and demonstrated that the optimal k value for SVM generally lies between 3 and 425. Jung et al.26 further strengthened the consensus on using k = 10 in a study assessing multiple artificial neural network models for predicting nitrate loads in river basins, using k values of 3, 5, 7, and 10 to gauge model performance26. Their findings favoured k = 10 with the best performance and less biased prediction.
Besides searching for the optimal k value, the current literature examines how k-fold cross-validation impacts bias and variance. Fushiki27 studied the estimation of the prediction accuracy in predictive modelling. This study highlighted that while training error is easy to compute, it has a downward bias. At the same time, k-fold cross-validation introduces an upward bias, especially with k values of 5 and 10. It often presents an upward bias that is significant enough to be ignored, potentially leading to overestimation of model performance. Yanagihara et al.28 found that while cross-validation is asymptotically unbiased, it still has a significant order O(n-1) bias, highlighting that k-fold cross-validation introduces an upward bias in prediction error estimation. The authors noted that having a smaller k value can lead to higher variance in the estimates, as the validation sets are more extensive, resulting in less data being used for training in each fold. They also report that larger k values reduce the variance but increase the bias because each training set is almost the entire dataset, causing the model to be trained on nearly the same data in each fold, underestimating the generalisation error. In another study, Burman29 investigated various cross-validation methods, including k-fold cross-validation, and found that k-fold cross-validation introduces an upward bias, which becomes significant, especially when k is small, such as two or three, leading to an overestimation of model performance.
In summary, the existing literature offers inconsistent insights into how different k values in k-fold cross-validation impact ML performance. Hence, there is a need for further research on the subject, particularly in understanding how model bias and variance respond when applied to datasets of varying sizes in relation to changes in k values. Doing so would help fill the gap in understanding and pave the way for more accurate and reliable supervised ML models.
k-Fold cross-validation, bias and variance
A crucial phase for building supervised ML models is establishing the validity, which helps assess their predictive capabilities. One of the most popular validation techniques is k-fold cross-validation, an ML approach that evaluates the predictive performance of a model on unseen data30. In this approach, the data sample is shuffled and separated into k equally sized subsets. This process is followed by iterative training and validation, with the model being trained and evaluated k times. The model is trained on k-1 sets in each iteration, and the remaining set is used for validation. Every iteration selects a different subset as the training and validation sets. The evaluation score (E), also known as the error rate, is recorded after each iteration. At the end of k iterations, the mean of the total error rate is calculated. Figure 1 illustrates the process of k-fold cross-validation.
Fig. 1.
Illustration of the k-fold cross-validation and its underlying process. E stands for evaluation.
The sophistication of k-fold cross-validation lies in its ability to handle the risk of overfitting and underfitting simultaneously, two pervasive challenges in ML model development. Overfitting occurs when the model is excessively complex and fits the training data so well that it memorises the noise in the training data31, which negatively impacts its performance when tested on an unknown dataset. On the other hand, underfitting describes the scenario in which the model lacks sufficient capacity to capture the genuine relationship in the data, thus performing poorly on unseen datasets31.
Variance quantifies the fluctuation of a model’s predictions around the mean value. It is the amount by which the performance of a predictive model varies when trained on different subsets of the training data32. This fluctuation may result from excessive complexity in the learning algorithm or from characteristics of the dataset, such as noise or non-homogeneity. A low variance means the model can consistently provide accurate estimates across different subsets of data from the same distribution. In this context, we calculate variance as the difference between each fold’s accuracy and the mean accuracy across all the folds during cross-validation. This quantifies the performance spread resulting from changes to the training data subset.
Bias occurs when an algorithm’s results consistently lean toward or against a specific idea, displaying a systematic inclination in one direction32. A high bias indicates the model’s inability to capture the underlying pattern in the dataset, often due to too simplistic assumptions leading to an underfitting model with a high error rate. Therefore, it is vital to keep the bias and variance as small as possible to minimise errors and improve the accuracy of the underlying algorithms. This can be achieved through techniques such as selecting an appropriate model, using regularisation to prevent overfitting, applying robust cross-validation strategies, and ensuring adequate data quality and preprocessing. Figure 2 illustrates the graphical representation of bias, variance, overfitting, and underfitting. In this study, we estimate bias by computing the difference between the mean training accuracy obtained through cross-validation and the test accuracy measured on unseen data. A significant difference between these values indicates a higher bias and reduced generalisation capability.
Fig. 2.
Graphical illustration of Bias, Variance, Overfitting and Underfitting. In this study, bias is computed as the signed difference between the mean cross-validated training accuracy and the held-out test accuracy, i.e., bias = (mean CV training accuracy) – (test accuracy). Positive values indicate that the cross-validated training performance exceeds the test performance (potential overfitting), whereas negative values indicate the converse. Variance is computed as the variability (across folds) of the validation accuracy within the training subset for a given k.
Methodology
The datasets used for this study are open-sourced data downloaded from the UCI Machine Learning repository33 and Kaggle34. This study analysed 12 datasets with different features, attributes, and size characteristics. The datasets have been further grouped into three sizes (large, medium, and small) to study the impact of data size on bias and variance. We define large datasets as those containing ≥ 40,000 instances, medium-sized datasets as those containing between 10,000 and < 40,000 cases, and small-sized datasets as those containing less than 10,000 instances. Table 1 details the 12 datasets used in this study. We performed all preprocessing steps using Python scripts to ensure transparency and reproducibility. The data cleaning process involved removing redundant records (rows with identical values across all attributes) and duplicates (rows with partial overlaps in key features that could bias training). Blank entries refer to empty cells, while NaN values indicate missing numerical data points. Blank and NaN values were handled by either imputing with median values or discarding records when the missingness rate exceeded 20% of the attributes. Categorical variables were converted using RIDIT (relative to an identified distribution) scoring to maintain ordinal relationships35. These steps reduced the number of instances by approximately 3–9% across datasets, while the feature count remained unchanged. Each dataset was first split into 70% for training and 30% for independent testing. For each dataset, both the training and test segments were normalised separately on a scale of zero to one to address the scale disparity issue among features and prevent data leakage, thereby ensuring that no feature disproportionately influences the classification outcome. The training set was then evaluated using k-fold cross-validation (k ranging from 3 to 20) to provide a robust estimate of performance. This splitting was performed randomly using a fixed random seed to ensure reproducibility. All preprocessing scripts, Python code for modelling ML algorithms and drawing relevant figures are publicly available at https://github.com/ShahadatUddin/Bias-Variance-Trade-Off, enabling full replication of the experiments.
Table 1.
Details of the datasets used in this study.
| Category | ID | Dataset context | Features | Instances | Target variable to model | Source |
|---|---|---|---|---|---|---|
| Large | D1 | Income prediction of adults | 14 | 48,842 | Income | 36 |
| D2 | Cover Type | 54 | 581,012 | Cover Type | 37 | |
| D3 | Sepsis Survival | 3 | 110,341 | Alive or dead | 38 | |
| D4 | IoT infrastructure | 83 | 123,117 | Attack Type | 39 | |
| Medium | D5 | Chess Game Dataset | 16 | 20, 058 | Winner | 40 |
| D6 | Letter Recognition | 16 | 20,000 | Letter | 41 | |
| D7 | Magic Gama Telescope | 10 | 19,020 | Class | 42 | |
| D8 | Stellar Classification | 5 | 39,552 | Target Class | 43 | |
| Small | D9 | Heart Disease | 16 | 4238 | Heart Stroke | 44 |
| D10 | Heart Disease | 12 | 1190 | Disease Outcome | 45 | |
| D11 | League of Legends Diamond | 38 | 9878 | Blue Wins | 46 | |
| D12 | White Wine Quality | 10 | 4898 | Quality | 47 |
To determine the impact of varying k values within k-fold cross-validation, this study employed k values ranging from three to 20. The diagnostic capabilities and performance of the supervised ML models for different k values were assessed using the calculated variance and the bias. For each dataset and algorithm, graphs were constructed to capture the bias versus k values and the variance versus k values. Trend lines have been incorporated into each graph to illustrate the directional trend in bias and variance as the k value increases.
Bias and variance were computed using the chosen performance metrics for each experiment. Each dataset was split once into training (70%) and testing (30%) subsets using a fixed random seed to ensure reproducibility. All k-fold cross-validation procedures were conducted exclusively within the training subset. Variance was assessed as the variability of performance scores across the k validation folds, indicating the sensitivity of the model to different training partitions. Bias was calculated as the difference between the average performance observed during cross-validation and the performance on the independent testing subset, which was never used during training or validation. This approach ensures that variance reflects the stability of model estimates under fold changes, while bias captures the generalisation gap between training and unseen data.
This study assesses bias and variance for each machine learning model using the widely adopted accuracy metric48. Accuracy represents the proportion of correctly classified instances. Evaluating bias and variance using this performance metric enables a more comprehensive understanding of model behaviour, revealing how variations in performance under different cross-validation folds reflect both stability and generalisation capability. To preserve the natural distribution of target categories, no artificial resampling was applied; instead, stratified k-fold cross-validation was employed to maintain class proportions across all folds49. This approach enables a comprehensive comparison of bias and variance across models and datasets, highlighting how variations in performance under different validation folds reflect both stability and generalisation capability. All fold assignments used stratified k-fold procedures to preserve class proportions. Plots that map trends across k use min–max normalisation within each dataset–algorithm configuration solely for visual comparability; all statistical computations were performed on the original (unnormalised) accuracy values.
Results
Figures 3, 4, 5, 6, 7 and 8 illustrate the relationship between the number of folds (k) and bias or variance across different dataset sizes. All values are normalised to [0,1] using min–max scaling. The correlation coefficient (r) indicates the direction of the trend: a positive value indicates an upward trend, and a negative value indicates a downward trend. Captions for individual figures focus only on dataset size and metric type. For non-linear patterns or cases with no clear trend, r should be interpreted cautiously, as they primarily capture linear association and may not fully represent complex relationships.
Fig. 3.
Trend in bias as the number of folds (k) increases for large datasets.
Fig. 4.
Trend in bias as the number of folds (k) increases for medium datasets.
Fig. 5.
Trend in bias as the number of folds (k) increases for small datasets.
Fig. 6.
Trend in variance as the number of folds (k) increases for large datasets.
Fig. 7.
Trend in variance as the number of folds (k) increases for medium-sized datasets.
Fig. 8.
Trend in variance as the number of folds (k) increases for small datasets.
Influence of k value and dataset size on bias
Large dataset
Figure 3 depicts the bias trends for large datasets as the number of folds (k) increases. A clear distinction emerges across algorithms. LR demonstrates a consistent upward bias across all large datasets, indicating a heightened risk of overfitting with larger k values. DT shows a balanced pattern, with upward and downward trends occurring in equal measure, suggesting relative stability. In contrast, SVM and KNN exhibit predominantly upward trends, reflecting sensitivity to fold size and potential over-specialisation as k grows. Overall, these results underscore that LR is most prone to bias escalation, while DT remains comparatively robust, and both SVM and KNN warrant caution when selecting higher k values. This pattern suggests that algorithmic complexity interacts with fold structure, amplifying bias in models that rely heavily on local decision boundaries, such as KNN and SVM. Consequently, practitioners should avoid assuming that increasing k uniformly improves generalisation, as the observed trends highlight the need for adaptive validation strategies tailored to model characteristics and dataset scale.
Medium dataset
Figure 4 illustrates bias trends for medium-sized datasets as the number of folds (k) increases. The upward bias pattern is more pronounced for KNN and SVM compared to large datasets, with both models consistently exhibiting upward trends across all medium datasets. This suggests that these algorithms become increasingly sensitive to fold size when the data volume is moderate, potentially amplifying the risks of overfitting. In contrast, DT and LR display more balanced behaviour, with upward trends in two datasets and downward trends in the other two, indicating greater stability under varying k values. These observations highlight that the algorithmic response to cross-validation structure is not uniform and may depend on dataset scale, reinforcing the need for model-specific validation strategies.
Small dataset
Figure 5 presents bias trends for small datasets as the number of folds (k) increases. KNN exhibits a consistent upward bias across all datasets, indicating heightened sensitivity to fold size and a strong tendency toward overfitting when k is large. SVM exhibits a similar pattern for datasets D9 through D11; however, this trend reverses for D12, where a downward bias emerges, indicating dataset-specific variability in its response to the cross-validation structure. In contrast, DT and LR display the opposite behaviour: both demonstrate downward trends for D9 to D11 and an upward trend for D12. These contrasting patterns highlight that bias dynamics under varying k values are strongly influenced by both algorithmic characteristics and dataset properties, underscoring the need for adaptive validation strategies rather than relying on fixed k values.
Influence of k value and dataset size on variance
The relationship between variance and the number of folds in k-fold cross-validation reveals a consistent and robust pattern. Across all datasets and algorithms, variance increases steadily as k grows (Figs. 6, 7 and 8). This escalation suggests that larger k values, although often assumed to improve reliability, can actually introduce greater instability in performance estimates. The underlying reason is that as k increases, validation sets become smaller and less representative, amplifying sensitivity to local data variations and producing more volatile outcomes. Consequently, although higher k values may reduce bias by enlarging training sets, this benefit is offset by the rising variance, which can distort perceptions of model stability. These findings underscore the importance of balancing bias reduction against variance inflation when selecting k values, rather than relying on conventional defaults such as k = 10. In practice, adaptive strategies that account for dataset size and algorithmic characteristics are crucial for avoiding misleading conclusions in model evaluation and selection.
The consistent increase in variance with larger k values has practical implications for model evaluation. While practitioners often select higher k values under the assumption of improved reliability, our findings indicate that this approach can lead to unstable estimates and misleading confidence in model performance. This effect is particularly critical in high-stakes domains, where inflated variance may obscure true generalisation capability. Therefore, rather than defaulting to conventional choices for k values, validation strategies should consider dataset size, algorithm sensitivity, and the trade-off between bias and variance to ensure robust and interpretable results.
Aggregate analysis of Bias–Variance trends
The patterns of bias and variance across SVM, DT, LR, and KNN (Figs. 3, 4, 5, 6, 7 and 8; Table 2) summarise how varying k relates to these quantities in our setting. For bias, KNN and SVM exhibit more frequent upward trends (10:2), LR shows a mixed pattern (7:5), and DT is balanced (6:6). For variance, all algorithms show an upward trend across datasets (12:0), consistent with the "Influence of k value and dataset size on variance" section. These summaries reflect the datasets, metrics, and algorithms considered and do not claim generality beyond this scope.
Table 2.
Summary of observed trends for bias and variance.
| ML model | Bias trend (up:down) in datasets | Variance trend (up:down) in datasets | ||||||
|---|---|---|---|---|---|---|---|---|
| Large | Medium | Small | All | Large | Medium | Small | All | |
| DT | 3:1 | 2:2 | 1:3 | 6:6 | 4:0 | 4:0 | 4:0 | 12:0 |
| KNN | 2:2 | 4:0 | 4:0 | 10:2 | 4:0 | 4:0 | 4:0 | 12:0 |
| LR | 4:0 | 2:2 | 1:3 | 7:5 | 4:0 | 4:0 | 4:0 | 12:0 |
| SVM | 3:1 | 4:0 | 3:1 | 10:2 | 4:0 | 4:0 | 4:0 | 12:0 |
These findings suggest that the choice of algorithm, in combination with cross-validation settings, can introduce systematic distortions in error estimation, potentially affecting model reliability and generalisability. Particularly for models like KNN and SVM, the dominance of upward bias under higher k values warrants caution, as it may lead to overly optimistic assessments of model performance. From a practical standpoint, this insight is crucial for both model selection and hyperparameter tuning workflows, particularly in domains that require high confidence in predictive robustness, such as healthcare, finance, or infrastructure risk assessment. Researchers and practitioners should therefore weigh the trade-offs between bias and variance more carefully, considering not just overall error rates but also the stability and directionality of evaluation metrics under different cross-validation regimes.
Discussion
This study presents a comprehensive empirical evaluation of the impact of varying k values in k-fold cross-validation on bias and variance across four widely used supervised learning algorithms (SVM, DT, LR and KNN) using datasets of varying sizes. Contrary to the commonly held assumption that increasing k leads to improved performance estimates, our findings reveal a consistent and significant increase in variance, as well as a lesser extent, bias with higher k values, raising essential questions about default cross-validation practices.
While some studies have explored k values below 3, including k = 2 or even Leave-One-Out Cross-Validation (k = 1), we deliberately excluded these extreme values due to practical and statistical considerations. For our datasets, which range from 1,190 to 581,012 instances, k = 2 would result in disproportionately large validation sets and relatively small training sets, leading to unstable estimates. Similarly, k = 1, although theoretically appealing, is computationally prohibitive for large datasets and is known to produce high variance due to nearly identical training folds, often resulting in overfitting. Empirical evidence and best-practice guidelines consistently recommend k ≥ 3 for robust performance estimation, with k = 5 or 10 widely regarded as optimal. Our choice of starting at k = 3 aligns with these recommendations and ensures a balance between computational feasibility and statistical reliability50.
A notable finding is the monotonic increase in variance with larger k values, observed consistently across all datasets and algorithms (Figs. 6, 7 and 8; Table 2). This finding challenges the prevailing notion, well-documented in the literature24,29, that increasing k results in lower variance due to reduced partitioning bias and improved representativeness of the training set. Our results align more closely with recent critiques20, which suggest that beyond a certain point, increasing k can lead to model instability, especially when training folds become nearly identical, and validation sets are too small to yield generalisable estimates. This rising variance may stem from reduced diversity in the validation data and heightened sensitivity to local data variations, both of which diminish the robustness of performance metrics.
The bias behaviour exhibited greater algorithmic heterogeneity. KNN and SVM consistently displayed upward trends in bias across most datasets, especially those of medium size, indicating a tendency toward overfitting as k increases. This finding is somewhat unexpected, as larger k values theoretically reduce bias by enabling models to train on more comprehensive subsets of the data. However, our findings suggest that frequently reusing nearly identical training samples across folds can lead to over-specialisation, thereby inflating training performance without corresponding improvements on unseen test data. This result parallels earlier theoretical concerns noted by Fushiki27 and Burman29, who cautioned against assuming unbiased estimates even with widely adopted k values such as 5 or 10.
In contrast, LR and DT demonstrated a more balanced or dataset-dependent bias trend. DT, in particular, exhibited an almost even distribution of upward and downward bias trends, underscoring its relative robustness to cross-validation structure. LR showed moderate upward bias, though it was less sensitive to changes in k than KNN or SVM. These discrepancies highlight the algorithm-specific nature of bias–variance responses and underscore the need for differentiated cross-validation strategies.
Methodologically, these observations advise against uncritically adopting fixed k (e.g., 5 or 10) without sensitivity checks. In high-stakes applications, relying on a single k may lead to optimistic or unstable estimates. Pragmatically, we recommend reporting sensitivity analyses over a small range of k values (e.g., {3, 5, 10}), and, where feasible, complementing accuracy with additional metrics suited to class imbalance and decision costs (e.g., F1, AUC). Open code and fixed random seeds, as provided here, can facilitate reproducibility and independent stress-testing.
This study considered accuracy as the primary evaluation metric due to its interpretability and widespread acceptance in machine learning research. We acknowledge, however, that accuracy alone may obscure class-level errors in imbalanced datasets. To mitigate this limitation, we employed stratified k-fold cross-validation to preserve the natural distribution of target categories across folds, thereby reducing the impact of imbalance. While incorporating additional metrics such as precision, recall, F1-score, AUC, or Matthews Correlation Coefficient would provide a more comprehensive assessment, doing so would substantially increase the number of figures and complexity of the manuscript, potentially compromising clarity and structure. For these reasons, we retained accuracy as the primary metric and explicitly recognise this as a limitation, positioning a multi-metric evaluation as an important direction for future research.
In sum, this work challenges the default assumptions underlying k-fold cross-validation and highlights the importance of empirically grounded, context-sensitive validation strategies. Rather than assuming that increasing k uniformly improves performance estimation, practitioners should consider the bias–variance dynamics unique to their model–data combination. These findings suggest the need for developing more adaptive cross-validation protocols, potentially informed by pre-validation diagnostics or meta-learning frameworks, to ensure reliable and generalisable performance assessment in machine learning workflows.
This study has several limitations that should be considered when interpreting the findings. We used accuracy as the sole evaluation metric, which may overlook class-level errors in imbalanced datasets. Additionally, our analysis was restricted to four classification algorithms, meaning that results may differ for other model families or regression tasks. The investigation focused on k values between 3 and 20, excluding extreme cases such as k = 2 or leave-one-out, due to practical constraints. Each dataset was evaluated using a single 70/30 train–test split with a fixed seed, which may introduce split-specific effects. Variance estimates and trend patterns were descriptive rather than inferential, and although stratified folds were used, dataset-specific issues and preprocessing choices may have influenced the outcomes. Finally, while trend plots were normalised for visual clarity, all statistical calculations used original accuracy values. These limitations underscore the importance of treating k-selection as a sensitivity parameter and encourage future work that incorporates multiple metrics, resampling schemes, and broader model classes.
Conclusion
This study investigated the relationship between varying the number of folds (k) in k-fold cross-validation and bias and variance across four standard classifiers and twelve datasets. In our experiments, variance increased as k grew, and bias often increased for SVM and KNN, whereas DT and LR showed more balanced patterns. These results, limited to the studied models, datasets, and the accuracy metric, suggest that fixed choices of k (e.g., 5 or 10) may not be universally optimal. We therefore recommend treating k as a sensitivity parameter and reporting stability across a small range of k values, complemented by metrics beyond accuracy where appropriate. Our open code and preprocessing scripts enable full replication and extension to other models, datasets, and evaluation criteria.
Author contributions
TA: Data analysis and interpretation, Software coding, and Writing. HX: Data analysis and interpretation, Software coding, and Writing. SU: Conception, Data analysis and interpretation, Supervision, Software coding, and Writing.
Funding
This research does not receive any funding from any internal or external sources.
Data availability
The datasets analysed during the current study are available in the Kaggle (https://www.kaggle.com/) and UCI Machine Learning (https://archive.ics.uci.edu/) repositories. All preprocessing scripts, Python code for modelling ML algorithms and drawing relevant figures are publicly available at https://github.com/ShahadatUddin/Bias-Variance-Trade-Off.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Mitchell, T. M. & Mitchell, T. M. Machine Learning Vol. 1 (McGraw-Hill, 1997).
- 2.Saeed, A., Husnain, A., Rasool, S., Gill, A. Y. & Amelia, A. Healthcare revolution: how AI and machine learning are changing medicine. Res. Social Sci. Econ. Manage.3 (3), 824–840 (2023). [Google Scholar]
- 3.Rajpoot, N. K., Singh, P. D., Pant, B. & Tripathi, V. The Future of Healthcare: A Machine Learning Revolution. in 2023 International Conference on Artificial Intelligence for Innovations in Healthcare Industries (ICAIIHI). IEEE. (2023).
- 4.Jáuregui-Velarde, R., Andrade-Arenas, L., Molina-Velarde, P. & Yactayo-Arias, C. Financial revolution: a systemic analysis of artificial intelligence and machine learning in the banking sector. International Journal of Electrical & Computer Engineering 2024. 14(1). (2088)–8708.
- 5.Shang, C. & You, F. Data analytics and machine learning for smart process manufacturing: recent advances and perspectives in the big data era. Engineering5 (6), 1010–1016 (2019). [Google Scholar]
- 6.Lillegård, M., Engen, S. & Sæther, B. E. Bootstrap methods for estimating Spatial synchrony of fluctuating populations. Oikos109 (2), 342–350 (2005). [Google Scholar]
- 7.Shcheglovitova, M. & Anderson, R. P. Estimating optimal complexity for ecological niche models: a jackknife approach for species with small sample sizes. Ecol. Model.269, 9–17 (2013). [Google Scholar]
- 8.Arlot, S. & Celisse, A. A survey of cross-validation procedures for model selection. (2010).
- 9.Vamsikrishna, A. & Gijo, E. New Techniques To Perform Cross-Validation for time Series Models. In Operations Research Forum (Springer, 2024).
- 10.Lumumba, V. W., Kiprotich, D., Lemasulani Mpaine, M., Grace Makena, N. & Daniel Kavita, M. Comparative Analysis of Cross-Validation Techniques: LOOCV, K-folds Cross-Validation, and Repeated K-folds Cross-Validation in Machine Learning Models (K-folds Cross-Validation, and Repeated K-folds Cross-Validation in Machine Learning Models, 2024).
- 11.Vasilopoulos, A. & Matthews, G. Cross-Validation optimal Fold-Number for model selection. Am. J. Undergrad. Res.21, 15–29 (2024). [Google Scholar]
- 12.Verma, V. K., Saxena, K. & Banodha, U. Analysis effect of k values used in k fold cross validation for enhancing performance of machine learning model with decision tree. in International Advanced Computing Conference. Springer. (2023).
- 13.Tamilarasi, P. & Rani, R. U. Diagnosis of crime rate against women using k-fold cross validation through machine learning. in fourth international conference on computing methodologies and communication (ICCMC). 2020. IEEE. 2020. IEEE. (2020).
- 14.Karal, Ö. Performance comparison of different kernel functions in SVM for different k value in k-fold cross-validation. in 2020 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE. (2020).
- 15.Ghorbani, R. & Ghousi, R. Comparing different resampling methods in predicting students’ performance using machine learning techniques. IEEE Access.8, 67899–67911 (2020). [Google Scholar]
- 16.Nti, I. K., Adekoya, A. F. & Weyori, B. A. A comprehensive evaluation of ensemble learning for stock-market prediction. J. Big Data. 7 (1), 20 (2020). [Google Scholar]
- 17.Tuarob, S., Tucker, C. S., Salathe, M. & Ram, N. An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages. J. Biomed. Inform.49, 255–268 (2014). [DOI] [PubMed] [Google Scholar]
- 18.Barstugan, M., Ozkaya, U. & Ozturk, S. Coronavirus (covid-19) classification using ct images by machine learning methods. arXiv preprint arXiv:2003.09424, (2020).
- 19.Oztekin, A., Kizilaslan, R., Freund, S. & Iseri, A. A data analytic approach to forecasting daily stock returns in an emerging market. Eur. J. Oper. Res.253 (3), 697–710 (2016). [Google Scholar]
- 20.Jiang, G. & Wang, W. Error Estimation based on variance analysis of k-fold cross-validation. Pattern Recogn.69, 94–106 (2017). [Google Scholar]
- 21.Marcot, B. G. & Hanea, A. M. What is an optimal value of k in k-fold cross-validation in discrete bayesian network analysis? Comput. Stat.36 (3), 2009–2031 (2021). [Google Scholar]
- 22.Teodorescu, V. & Obreja Brașoveanu, L. Assessing the validity of k-Fold Cross-Validation for model selection: evidence from bankruptcy prediction using random forest and XGBoost. Computation13 (5), 127 (2025). [Google Scholar]
- 23.Nti, I. K., Nyarko-Boateng, O. & Aning, J. Performance of machine learning algorithms with different K values in K-fold cross-validation. J. Inf. Technol. Comput. Sci.6, 61–71 (2021). [Google Scholar]
- 24.Rodriguez, J. D., Perez, A. & Lozano, J. A. Sensitivity analysis of k-fold cross validation in prediction error Estimation. IEEE Trans. Pattern Anal. Mach. Intell.32 (3), 569–575 (2009). [DOI] [PubMed] [Google Scholar]
- 25.Anguita, D., Ghelardoni, L., Ghio, A., Oneto, L. & Ridella, S. The ‘K’ in K-fold Cross-Validation. in ESANN. (2012).
- 26.Jung, K. et al. Evaluation of nitrate load estimations using neural networks and canonical correlation analysis with k-fold cross-validation. Sustainability12 (1), 400 (2020). [Google Scholar]
- 27.Fushiki, T. Estimation of prediction error by using K-fold cross-validation. Stat. Comput.21, 137–146 (2011). [Google Scholar]
- 28.Yanagihara, H., Tonda, T. & Matsumoto, C. Bias correction of cross-validation criterion based on Kullback–Leibler information under a general condition. J. Multivar. Anal.97 (9), 1965–1975 (2006). [Google Scholar]
- 29.Burman, P. A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika76 (3), 503–514 (1989). [Google Scholar]
- 30.Stone, M. Cross-validatory choice and assessment of statistical predictions. J. Roy. Stat. Soc.: Ser. B (Methodol.). 36 (2), 111–133 (1974). [Google Scholar]
- 31.Bashir, D., Montañez, G. D., Sehra, S., Segura, P. S. & Lauw, J. An information-theoretic perspective on overfitting and underfitting. in AI 2020: Advances in Artificial Intelligence: 33rd Australasian Joint Conference, AI Canberra, ACT, Australia, November 29–30, 2020, Proceedings 33. 2020. Springer. (2020).
- 32.Belkin, M., Hsu, D., Ma, S. & Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32): pp. 15849–15854. (2019). [DOI] [PMC free article] [PubMed]
- 33.Repository, U. M. L. & [ cited ; (2023). 3-November] Available from: https://archive.ics.uci.edu/
- 34.Kaggle. Available from: https://www.kaggle.com/datasets
- 35.Bross, I. D. How to use ridit analysis. Biometrics, : pp. 18–38. (1958).
- 36.Becker, B. & Kohavi, R. Adult. ; (1996). 10.24432/C5XW20
- 37.Blackard, J. Covertype ; 10.24432/C50K5N (1998). [Google Scholar]
- 38.Chicco, D. & Jurman, G. Sepsis Survival Minimal Clinical Records. ; (2020). 10.24432/C53C8N
- 39.Sharmila, B. & Nagapadma, R. RT-IoT2022. ; (2023). 10.24432/C5P338
- 40.Mitchell, J. Chess Game Dataset. ; (2017). https://www.kaggle.com/datasets/datasnaek/chess
- 41.Slate, D. Letter Recognition. ; (1991). 10.24432/C5ZP40
- 42.Bock, R. Magic Gamma Telescope ; (2004). 10.24432/C52C8B
- 43.Soriano, F. Stellar Classification Dataset. ; (2020). https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17/data
- 44.Hasnine, M. Heart Disease Dataset ; 10.34740/kaggle/dsv/5146672 (2023). [Google Scholar]
- 45.Siddhartha, M. Heart Disease Dataset. ; (2020). https://www.kaggle.com/datasets/sid321axn/heart-statlog-cleveland-hungary-final/code
- 46.Ma, Y. L. League of Legends Diamond Ranked Games. ; (2020). https://www.kaggle.com/datasets/bobbyscience/league-of-legends-diamond-ranked-games-10-min
- 47.Cortez, P., Cerdeira, A. & Almeida, F. White Wine Quality. https://www.kaggle.com/datasets/piyushagni5/white-wine-quality
- 48.Makridakis, S. Accuracy measures: theoretical and practical concerns. Int. J. Forecast.9 (4), 527–529 (1993). [Google Scholar]
- 49.Wieczorek, J., Guerin, C. & McMahon, T. K-fold cross‐validation for complex sample surveys. Stat11 (1), e454 (2022). [Google Scholar]
- 50.Nti, I. K., Nyarko-Boateng, O. & Aning, J. Performance of machine learning algorithms with different K values in K-fold cross-validation. Int. J. Inform. Technol. Comput. Sci.13 (6), 61–71 (2021). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets analysed during the current study are available in the Kaggle (https://www.kaggle.com/) and UCI Machine Learning (https://archive.ics.uci.edu/) repositories. All preprocessing scripts, Python code for modelling ML algorithms and drawing relevant figures are publicly available at https://github.com/ShahadatUddin/Bias-Variance-Trade-Off.








