Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Jan 30;16:6766. doi: 10.1038/s41598-026-37957-2

Position prediction from performance and anthropometric indicators in young footballers: a machine learning approach

Zhomart Izhanov 1, Yerlan Seisenbekov 1, Ulbossyn Marchibayeva 2, Zhandos Yessirkepov 3, Sayagul Bakhtiyarova 4, Baglan Yermakhanov 5, Sayat Ryskaliyev 6, Ahmet Kurtoğlu 7,, Monira I Aldhahi 8,
PMCID: PMC12913652  PMID: 41618009

Abstract

The early and accurate identification of playing position-specific skills in young footballers is of critical importance for both performance development and long-term player planning. In this context, the evaluation of quantitative data obtained from technical tests using analytical methods provides a more objective approach that supports the coach’s intuition. This study aims to predict the playing positions of young footballers by using data obtained from anthropometric and technical performance tests with machine learning (ML) algorithms. This study involved 200 male footballers aged 15–17 who played in different positions (defence = 66, midfield = 67, forward = 67) and were recorded according to the primary tactical role assigned to them by the coach. The participants’ football-specific technical skills (ball control, shooting, dribbling) and anthropometric characteristics (height, weight, age, BMI) were recorded. Their technical and anthropometric characteristics were compared according to their playing positions using an ANOVA test, and the Bonferroni post-hoc test was performed to test for differences between groups. After data pre-processing and standardisation, the model created with the obtained technical and anthropometric parameters was analysed using Support Vector Machines (SVM, RBF kernel), K-Nearest Neighbour (KNN), Logistic Regression (LR) and Gaussian Naive Bayes algorithms. Model performances were compared based on accuracy, precision, sensitivity, and macro F1 scores. ROC curves and confusion matrices were analysed for the model with the highest performance. Furthermore, the technical and anthropometric parameters affecting the highest performance were analysed using the permutation importance method. The results of the one-way ANOVA showed significant differences between playing positions in terms of age, height, BMI, heading, and dribbling performance (p < 0.05). Post-hoc analyses revealed that midfielders were older than defenders. Forwards, on the other hand, were both taller and had lower BMI values. Furthermore, forwards demonstrated higher heading performance and achieved better results in dribbling skills compared to defenders. Among the ML models, the highest classification success was achieved with the SVM (RBF kernel) model (accuracy = 86%); the model correctly classified forwards at a rate of 100%, midfielders at 85%, and defenders at 75%. ROC analysis revealed high discriminative power for all playing positions, with AUC values of 1.00 for Forwards, 0.96 for Defenders, and 0.94 for Midfielders. Feature importance analysis revealed that the most influential variables in playing position classification were 20 m dribbling, shooting, body weight, and dribbling; while the head juggling and mixed juggling variables contributed the least to the model. These findings demonstrate that playing position-specific physical and technical characteristics in footballers can be reliably distinguished using both statistical methods and machine learning models, and that performance variables based on speed, finishing ability and physical capacity are particularly decisive in playing position classification.

Keywords: Footbal, Position prediction, Performance machine learning, Artificial intelligent

Subject terms: Computational biology and bioinformatics, Health care, Mathematics and computing

Introduction

Football is a sport branch with the largest audience in the world and has its own economic arena. For this reason, one of the most important research topics in football is to achieve a high level of performance and to maintain this level of performance throughout the season1. Therefore, performance analysis in football is a critical process that includes physical, technical, tactical and psychological performances of players2. Machine learning (ML) and artificial intelligence (AI) predict player performance trends and analyse training, performance and match schedules by establishing accurate and complex relationships within this comprehensive data.

The increasing use of ML and AI methods in sports analysis today has brought about a fundamental transformation in the processes of evaluating player performance, in the planning of training sessions, and developing team strategies36. These technologies offer a unique capacity to process large data sets, uncover complex patterns and make predictions that would be impossible to obtain using traditional methods7. In football, the ability to analyze player performance8, predict injuries9,10, and optimize team strategies has been significantly enhanced by these technological advances11. In this context, ML functions as a key analytical tool that produces informed predictions by examining relationships within large and complex datasets through various algorithms, thereby enabling coaches and analysts to make evidence-based decisions more effectively.

In recent years, the application of machine learning to football has become a rapidly developing area of research in academic literature. Significant findings have been obtained using various machine learning algorithms in many areas, from predicting match results to objectively evaluating player performance12. For example, Berrar et al. demonstrated the effectiveness of machine learning models in predicting match results using historical match data13, while Memmert et al. used machine learning techniques to identify key factors affecting player performance14. These studies demonstrate that data-driven approaches are playing an increasingly critical role in football and clearly highlight the transformation that machine learning technologies are bringing to the sport. One of the prominent areas of research in recent times is the determination of players’ playing positions on the pitch using machine learning methods. Traditionally, playing position determination relied largely on the experience of coaches and subjective assessments based on observation; however, with the advancement of ML, it has become possible to objectively analyse players’ technical, physical, and performance indicators to predict their most suitable playing positions15. Studies such as Manish et al. and Link et al. have applied machine learning algorithms to categorize players into specific playing positions based on various performance metrics, demonstrating the potential of these methods to improve player development and team-building strategies16,17. Early prediction models, especially for younger age groups, can contribute to athletes gaining more experience in their current playing positions, thereby accelerating their progress to higher levels of performance.

It is known that the playing positions played in football players require some anthropometric characteristics. Therefore, it is expected that anthropometric characteristics such as height, weight, body fat, muscle mass and body fat should be different in athletes playing in different playing positions18. In addition, due to the different requirements of the game, the technical, tactical, and physical characteristics of individuals playing in different playing positions are also expected to be different. For example, strikers are expected to have high agility and speed skills, while defenders are expected to have good endurance and strength19. However, with the increasing competition in football, it has been determined that there is no significant performance difference in professional teams. Increased running distances, sprint levels, cardiovascular endurance levels in football are not affected by the playing position played in elite athletes20. However, in young age groups, determining the playing position played according to the individual’s own characteristics may contribute to professionalisation in that playing position.

Based on this, the primary objective of the study is to objectively measure playing position-specific technical performance outputs in footballers and to evaluate the extent to which these variables are decisive in playing position prediction. The increasing specificity of player roles in football necessitates that performance be supported not only by intuitive assessments but also by measurable technical and physical indicators. Therefore, the study investigated the extent to which technical performance components closely related to playing position, such as dribbling, 20-metre dribbling, and shooting, are effective in distinguishing players’ roles on the pitch. Furthermore, the aim was to understand how considering anthropometric characteristics and technical skills together could contribute to positional guidance processes in early age groups. Within this scope, the hypotheses tested in the study were determined as follows:

H1a

Differences in anthropometric characteristics among footballers playing in different positions also affect technical characteristics.

H1b

Footballer playing positions can be reliably predicted using ML models based on these variables.

H1c

Certain technical indicators (dribbling, 20 m dribbling, shooting) have higher predictive power than other variables in playing position classification.

Methods

Participants

This study included 200 young male footballers aged between 15 and 17. Taking into account the footballers’ sporting history and training experience, the average training experience of the participants was recorded as 3.56 ± 1.43 years. The participants were selected from young footballers at various football clubs in Kazakhstan. Participants were selected based on the criteria of being within a specific age range (15–17) and having at least 1 year of football history. In addition, all participants were healthy individuals with no chronic injuries or serious health problems. In this context, volunteers aged 15–17 who had been licensed football competitors for at least 1 year were included in the study. Participants with orthopaedic problems that could affect their performance in the last year, heart (tachycardia, hypertension, etc.) and lung problems (COPD, asthma, etc.), and players with any current musculoskeletal injury or with a musculoskeletal injury in the previous six months that resulted in time-loss from training or matches or caused persistent pain/functional limitation were excluded from the study. Furthermore, participants who did not heed the principal investigator’s warnings and did not take the tests seriously were also excluded from the study.

The minimum sample size for this study was calculated using G-power software 3.1.9.7 (University of Dusseldorf, Dusseldorf, Germany)21. In the power analysis, F-test was analyzed using ANOVA: Fixed effects, omnibus, one-way. Accordingly, if α err sample = 0.05; minimum effect size = 0.23; power (1-β err sample) = 0.80; it was calculated that there should be at least 186 participants with 80.1% actual power21. Participants were randomly selected and accordingly, the targeted sample size was reached and statistical power was ensured.

Within the scope of this research, the necessary ethics committee permission was obtained from the ethics committee Faculty of Physical Education and Basic Military Sciences Ethics Committee with decision number 2024/11. In addition, all participants and their families were informed about the purpose, reason, and possible contributions of the study to the literature and consent forms were signed by both participants and their families. The study was conducted in accordance with the principles set out in the Declaration of Helsinki.

Data collection tools

Head and foot juggling test

In this study, foot and head juggling test was applied to evaluate the ball control, balance and coordination skills of the participants. This test is a widely used performance test to measure the ability of footballers to control the ball with their feet and head. During the test, the number of times participants could bounce the ball without dropping it with their feet or head was recorded. Before the test, the purpose and procedure were explained, and participants bounced a standard football on a flat surface wearing appropriate sports clothing. The test ended when the ball touched the ground or the participant lost control. Each participant took the test twice, and the highest number of repetitions was used in the analysis. The foot and head juggling test is recognized in the literature as a valid and reliable method for assessing the ball control and balance skills of footballers. This test is frequently used especially to evaluate the technical skills of young footballers and to determine their suitability for their playing positions on the field22.

Mor-Christian general football skill test

Mor-Christian football test (dribbling and shooting) was applied to measure technical skills in football. In the validity and reliability of the test, validity coefficients of 0.73 for dribbling and 0.91 for shooting were revealed. Reliability coefficients of 0.80 for dribbling and 0.98 for shooting were obtained using the test and test validation approach. A station was prepared for the dribbling test. The station diameter of 18 m was measured and marked. 12 funnels (45 cm high) were arranged in a circle at 4.5 m intervals. The 1 m starting line was marked outside the circle perpendicular to the circle. Participants started from the starting line and continued by slalom between the funnels and the test was finished when they returned to the starting line. After the familiarisation phase, each participant performed two repetitions and the best value was recorded in seconds. For the shooting test, 4 circles with a diameter of 1.21 cm were placed in the corners of the football goal. 4 shots were made in each circle in turn. The shots to the correct target were scored 10 points and the shots to the wrong target were scored 4 points23,24.

20-meter dribling test

The 20 m dribling test was performed on a 20 m long and 3 m wide flat track. There are five cones on the track, each placed 5 m apart from the starting line. The cones are arranged in such a way that the players dribble the ball and slalom through them. After the familiarisation phase, participants performed two repetitions and the best value was recorded in seconds25.

Machine learning approaches

To enhance the model’s reliability and prediction performance, various machine learning applications were utilised in this studyThe variables used in classification models have been determined in line with the technical requirements of football playing positions and consist of performance indicators such as dribbling, 20-metre dribbling, shooting, ball control, and anthropometric measurements. In this study, algorithmic feature selection methods such as RFE or LASSO were not used; variables were selected based on expert opinion, playing position-specific performance requirements, and content validity.

The data modelling process was carried out in the Python environment. During the model creation phase, the data was first standardised, then stratified into a 70% training set and a 30% test set to preserve class distribution. Stratified 10-fold cross-validation was applied during the training process to increase model generalisability and prevent overfitting. This method evaluated the stability of model performance across different subsamples by keeping the class ratios constant in each layer.

The ML process utilised SVM (RBF kernel), K-Nearest Neighbour (KNN), Logistic Regression, Decision Tree, and Gaussian Naive Bayes algorithms. The models were optimised through hyperparameter tuning; the kernel type and regularisation coefficient for SVM, the number of neighbours for KNN, and the splitting criterion and tree depth for Decision Tree were improved using the grid search method. The models were compared based on accuracy, precision, sensitivity, and macro F1-score. For the SVM (RBF kernel) model, which showed the best performance, confusion matrices and ROC curves were obtained to examine class-based performance in more detail, and AUC values were calculated for each class.

Machine learning algorithms

Support vector machines (SVM): RBF kernel

SVM is a powerful supervised learning method that aims to find a separating hyperplane that maximises the margin between classes by transforming the data into a high-dimensional feature space. The fundamental approach relies on defining the optimal decision surface through “support vectors”, which are the examples closest to the decision boundary. Due to its ability to successfully model non-linear relationships, the RBF kernel is frequently preferred for classifying complex performance data such as football26. In this study, the SVM model was employed using the one-vs-rest strategy due to the multi-class structure of the playing positions (Defence, Midfield, Forward). Hyperparameters such as C (regularisation coefficient) and gamma (kernel width) were tuned using grid search and cross-validation. SVM was selected because it provides high generalisability, particularly in situations where class boundaries are sharp and non-linear27.

K-nearest neighbours (KNN)

KNN is a non-parametric, sample-based classification algorithm. In this method, a new observation to be classified is assigned based on the class label of the k nearest neighbours in the training data (majority vote). Operating on a distance metric (usually Euclidean distance), KNN does not produce a closed-form function for the decision boundary; instead, it uses the local neighbourhood structure by keeping the entire training set in memory. In this study, the K value (number of neighbours) was determined through experimental testing and cross-validation, and standardisation was applied using StandardScaler to prevent differences in feature scales from distorting the distance calculation28,29.

Logistic regression

Logistic regression is a classical linear model capable of estimating probabilities in classification problems. The probability of the dependent variable belonging to a specific class is modelled through a linear combination via the logistic (sigmoid) function. In this study, logistic regression was used as the baseline model for comparison purposes in multi-class playing position estimation. Features were scaled using StandardScaler, and the model was stabilised against overfitting risk using L2 penalty (ridge regularization). Although not as powerful as SVM in capturing more complex non-linear boundaries, logistic regression has the advantage of interpretability, and the low accuracy/F1 scores obtained serve as a baseline to highlight the additional contribution of the more advanced algorithms (SVM, KNN) used in the study30,31.

Gaussian Naive Bayes (GNB)

Naive Bayes classifiers are probabilistic models that operate based on Bayes’ theorem and utilise the assumption that each feature is conditionally independent of the class. In the Gaussian Naive Bayes version used in this study, it is assumed that the conditional distribution of each continuous feature is normal (Gaussian). Despite the assumption of feature independence not being fully satisfied in real life, GNB is a method that can produce quite fast and often surprisingly effective results even with small samples and high-dimensional data structures. In this study, GNB was used as a ‘simple but useful’ comparison model (baseline) due to its computational lightness and the fact that its parameters can be estimated with closed-form solutions. The low accuracy and F1 scores obtained are important in demonstrating the performance improvement provided by more complex methods (particularly SVM and KNN).

Performance metrics and model selection

The performance of the models was compared with performance metrics calculated on the test data set. These metrics include accuracy, precision, recall and F1 score. Using the Confusion Matrix, the correct and incorrect classification rates of each model were analysed in detail. ROC (Receiver Operating Characteristic) analysis and metrics obtained from the confusion matrix were used to evaluate the performance of the machine learning models. The ROC curve evaluates the ability of the model to discriminate the positive class at various thresholds. The AUC (Area Under the Curve) value on the ROC curve refers to the area under the ROC curve and is a single score summarising the classification performance of the model. As the AUC value approaches 0.5, it means that the model makes a random prediction, and as it approaches 1, it means that the model makes a perfect discrimination. According to the data obtained from the Confusion Matrix, True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN) values were calculated (Fig. 1).

Fig. 1.

Fig. 1

TP (True Positive)-positive class predicted by the classifier (grade 1), False Positive (FP)-incorrectly predicted by the classifier (class 1), False Negative (FN) - negative class predicted incorrectly by the classifier (class 2) and TN (True Negative) – are negative class (class 2) values that the classifier predicts correctly. The equations given below are used to calculate performance metrics.

graphic file with name d33e481.gif 1
graphic file with name d33e486.gif 2
graphic file with name d33e491.gif 3
graphic file with name d33e496.gif 4

These metrics, obtained with the help of ROC analysis and confusion matrix, evaluate the accuracy, sensitivity and overall performance of each machine learning model. While the ROC curve and AUC value show how well the model distinguishes the positive class, metrics such as accuracy, sensitivity and F1 score obtained from the confusion matrix evaluate the accuracy, precision and error rates of the model in detail.

Statistical analysis

Statistical analyses were performed in the Python environment, and a significance level of p < 0.05 was accepted for all evaluations. One-way analysis of variance (ANOVA) was applied for technical performance (top spin, shot, dribbling, 20 m dribbling) and anthropometric variables (age, height, weight, BMI). To assess the assumptions of ANOVA, normality was tested using the Shapiro–Wilk test, and variance homogeneity was tested using the Levene test. Classical ANOVA was used for variables with variance homogeneity, and Welch ANOVA was used for those without. For variables found to be significant in ANOVA, Bonferroni-corrected post-hoc tests were applied to determine the differences between groups.

Results

Table 1 presents the results of the ANOVA test for participants according to the playing position they played. One-way ANOVA results demonstrated significant group differences for Age (F(2,197) = 4.583, p = 0.011), Height (F(2,197)=29.341, p < 0.001), BMI (F(2,197) = 22.452, p < 0.001), Head Ball Bounce (HBB) (F(2,197) = 3.428, p = 0.034), and Dribbling (F(2,197) = 4.624, p = 0.011). Post-hoc Bonferroni tests indicated that Midfield players were older than Defence players (M > D; p = 0.009); Forward players were significantly taller than both Midfield (F > M; p < 0.001) and Defence players (F > D; p < 0.001); Forward players had lower BMI values compared with Defence (F < D; p < 0.001) and Midfield players (F < M; p < 0.001); Forward players showed higher head juggling performance than Defence players (F > D; p = 0.034); and Forward players demonstrated better Dribbling performance (lower time) than Defence players (F < D; p = 0.009). No significant differences were observed for Weight, Foot Ball Bounce (FBB), Shot, or 20-m Dribbling (p > 0.05).

Table 1.

Comparison of demographic and performance parameters of players according to player playing position.

Parameters Group N Mean ± SD F p-value Bonferoni
Age (year) Defence 66 15.87 ± 0.54 4.583 0.011 M-D (p = 0.009)
Midfield 67 16.20 ± 0.50
Forward 67 16.07 ± 0.80
Weight (kg) Defence 66 56.81 ± 3.32 0.821 0.442
Midfield 67 56.20 ± 2.63
Forward 67 56.56 ± 2.20
Height (cm) Defence 66 170.69 ± 5.16 29.341 < 0.001

F-M (p = < 0.001)

F-D (p < 0.001)

Midfield 67 169.91 ± 3.38
Forward 67 175.35 ± 4.61
BMI (kg/m2) Defence 66 19.54 ± 1.56 22.452 < 0.001

D-F (p < 0.001)

M-F (p < 0.001)

Midfield 67 19.46 ± 0.61
Forward 67 18.41 ± 0.86
Mixed juggling (rep) Defence 66 95.12 ± 53.69 1.801 0.168
Midfield 67 106.13 ± 58.15
Forward 67 74.47 ± 28.56
Head juggling (rep) Defence 66 27.07 ± 9.64 3.428 0.034 F-D (p = 0.034)
Midfield 67 32.05 ± 15.28
Forward 67 34.22 ± 21.25
Shoot (rep) Defence 66 25.09 ± 7.03 0.992 0.373
Midfield 67 26.68 ± 6.65
Forward 67 25.07 ± 8.90
Dribbling (sec) Defence 66 15.62 ± 1.91 4.624 0.011 F-D (p = 0.009)
Midfield 67 15.06 ± 1.23
Forward 67 14.76 ± 1.76
20 m Dribbling (sec) Defence 66 2.95 ± 0.41 2.286 0.104
Midfield 67 2.82 ± 0.26
Forward 67 2.92 ± 0.40

FBB Foot Ball Bounce, HBB Head Ball Bounce.

Figure 2 presents the confusion matrix results for the SVM (RBF kernel) classification model. The model classified players in the Forward playing position with 100% accuracy, while correctly classifying 85% of Midfield players and 75% of Defence players. The majority of misclassifications occurred between the Defence and Midfield playing positions. This distribution indicates that playing position-specific performance variables are more pronounced in Forward players, but there is some overlap between the Defence and Midfield groups.

Fig. 2.

Fig. 2

SVM Confusion Matrix for prediction between participants’ performance and their playing position.

The curves in the graph in Fig. 3 present the relationship between the True Positive Rate and False Positive Rate of the model for each class at different thresholds. Accordingly, the model’s discriminative power was found to be high across all playing positions. The AUC value for the Forward class is 1.00, indicating perfect discrimination. High classification performance was also achieved for the Defence (AUC = 0.96) and Midfield (AUC = 0.94) classes. The fact that all curves are above the 0.50 reference line demonstrates that the model performs significantly better than random classification for all playing positions (Table 2).

Fig. 3.

Fig. 3

ROC curve results according to participants’ predictions.

Table 2.

Classification performance with SVM (RBF Kernel).

Class name Recall Specificity Precision F1-Score
Defence 0.75 1.00 1.00 0.85
Forward 1.00 0.92 0.86 0.93
Midfield 0.85 0.87 0.77 0.80
Accuracy 0.86
Misclassification Rate 0.13

The classification performance metrics of the SVM (RBF kernel) model differed across playing positions. The forward playing position was the group classified with the highest accuracy, with a recall of 1.00 and an F1-score of 0.93. For the defence playing position, the model showed high specificity (specificity = 1.00) and high precision (precision = 1.00), but the recall remained moderate (recall = 0.75). For the Midfield playing position, the classification performance was moderate, with a recall of 0.85 and an F1-score of 0.80. The overall accuracy of the model was 86% (accuracy = 0.86), while the misclassification rate was 13% (misclassification rate = 0.13). These findings indicate that the model performed strongly in distinguishing the Forward playing position, but showed limited performance in distinguishing between Defence and Midfield.

Table 3 presents a comparative analysis of the classification performance metrics of different machine learning models. The SVM (RBF kernel) model demonstrated the highest performance across all metrics (accuracy = 0.87; F1-macro = 0.87). The KNN model showed moderate performance (accuracy = 0.80), but lagged behind the SVM. The low accuracy and F1-macro values of the Logistic Regression and Gaussian Naive Bayes models (accuracy = 0.52) indicate that these models are inadequate for playing position classification. Overall, the SVM model offered the strongest classification performance, while KNN was the second most successful model; Logistic Regression and Gaussian NB were considered baseline reference models.

Table 3.

Playing position selection performance of different ML algorithms.

Model Accuracy Precision Recall F1-Macro
SVM (RBF Kernel) 0.86667 0.880764 0.866667 0.865633
KNN 0.800000 0.806017 0.800000 0.795792
Logistic regression 0.516667 0.513889 0.516667 0.512759
Gaussian NB 0.516667 0.498016 0.516667 0.503590

Figure 4 shows the feature importance values obtained for the SVM (RBF kernel) model. The variables that contributed most to playing position classification were, in order, 20-metre dribbling, shooting performance, body weight, and dribbling. Permutation of these variables caused the greatest decline in model accuracy. In contrast, the juggling_head and juggling_mixed variables had a rather limited effect on the model. The standard deviation bars in the graph reflect the variability between permutation repetitions for each variable. These findings indicate that playing position classification is strongly determined by performance components that particularly require speed, finishing ability, and physical strength.

Fig. 4.

Fig. 4

Feature importance values for the SVM (RBF kernel) model (permutation importance method).

Discussion

In this study, data obtained on the anthropometric and performance parameters of footballers in different playing positions were predicted using ML algorithms. As a result of routine statistical procedures, significant differences were observed between participants’ playing positions in terms of height, BMI, heading ability, and dribbling skills. In the ML method that included all parameters, SVM (RBF kernel) predicted the players’ playing positions with the highest accuracy. Furthermore, in the permutation-based analysis examining the effect of parameters influencing ML performance, it was concluded that metrics such as dribbling and shooting accuracy with and without a football are important parameters in predicting playing positions. In this context, the hypotheses identified in our study were confirmed.

ML and AI methods are used in football, as in all sports, particularly for analysing complex data, and are frequently used to predict the performance of athletes at all levels. They are often used in analysing match and training performance, predicting match outcomes, predicting injury risk, and predicting certain physiological needs. Indeed, Herolds et al. emphasised that machine learning techniques have achieved high success in determining the factors that affect player performance and revealed that these methods increase the importance of data-driven approaches in sports science32. Similarly, Berrar et al. also reported that ML methods can be successfully applied to predict match results using past match data13. In some studies, ML methods have also been used to predict athletes’ playing positions and injury risks15. In addition, Schuth et al. stated that ML algorithms make significant contributions to the development of football strategies12. Ćwiklinski et al. highlighted the role of this technology in optimizing team performance and supporting individual player development33. ML algorithms extract meaningful patterns from large data sets thanks to their different algorithms, and the results obtained provide a significant advantage in optimising decision-making processes. Therefore, the wide range of such research and the results obtained from algorithms emphasise the need to use new features.

Research conducted using ML methods in the field of sports science is producing significant outputs for many disciplines. ML algorithms are increasingly being used, particularly in sports where decisions are often based on subjective assessments. A recent literature review emphasised that algorithms for evaluating human movement should provide personalised, informative, real-time or exercise-time calculable results; they should also be explainable and interpretable. However, it was concluded that data imbalance, accuracy–speed balance, appropriate algorithm selection, and evaluation approaches are also fundamental elements that must be considered34. One of the important issues to be considered is the choice of the algorithm model used. In our study, the SVM model achieved the highest accuracy rates. In addition, decision trees, K-Nearest Neighbour (KNN), and SVM are widely used in classification problems35. One of the most important reasons for choosing the SVM algorithm in our research is that SVM finds the best hyperplane for class separation and is suitable for small datasets. However, Dai, in his study, suggests that the criticisms of SVM analysis can be clarified by reconstructing the algorithm’s features and the limitations of different examples36. In our study, while the SVM model classified forwards with 100% accuracy, it demonstrated lower classification success in defensive and midfield playing positions. This may stem from playing position-specific technical characteristics not always being clearly distinguishable among younger age groups. In particular, the fact that defenders and midfielders exhibit similar performance profiles in tests such as dribbling, short-distance ball control and shooting has made it difficult for the model to distinguish between these two groups37,38. In contrast, forwards show more distinct and consistent characteristics in key variables such as 20-metre dribbling time, shooting performance and weight, enabling the model to recognise this playing position more easily. The fact that playing positional roles are not yet fully established in young players (aged 15–17) also increases the uncertainty, particularly in the defence–midfield distinction; the fact that playing positional requirements, such as the physical endurance specific to defensive players or the finishing skills specific to forwards, are not yet fully developed limits the capacity of technical tests to reflect this difference. Therefore, the imbalance in model performance can be explained by both the comprehensiveness of the measured technical parameters and the natural homogeneity of the developmental period.

Another significant reason for the low prediction performance of athletes other than some forwards in our study may be the performance parameters we identified in our study. In different studies using ML, higher performance levels have been achieved and different performance parameters have been used in these studies39. In fact, the research conducted by Utoma and Wiradinata focused on the performance parameters required by the playing position in football such as Passing Capabilities (Average Passes), Offensive Capabilities (Possession, etc.), Defensive Capabilities (Blocks, Through Balls, Tackles, etc.), and Summary (Playtime, Goals, Assists, Passing Percentage, etc.) and reached an accuracy of 69% to 76%40. Bruno et al. investigated the relationship between GPS parameters and some physiological parameters of football players playing in different playing positions in football. The results showed that the displacements of all players (defenders, midfielders and forwards) were closer and more coordinated to their playing position-specific centroids than to other centroids. However, this merging effect was stronger for midfielders and weaker for forwards. In this context, since athletes playing in different playing positions have different physiological and technical needs, it is thought that we should include different parameters in the ML algorithm besides some of the performance parameters we used in our research41. Supporting this view, in a recent study, Beato et al. compared external load parameters, distance travelled, high-speed running, accelerations, decelerations, and metabolic load distance of footballers playing in different playing positions42. Consequently, it has been observed that physical performance in footballers varies according to the position played, but physical performance in official matches is influenced not only by position but also by the dynamics of the match and tactical requirements. It can be concluded that in the fast and variable structure of today’s football, individual differences are decreasing and the physical and physiological needs of the playing positions in football are becoming increasingly similar.

Our study included fundamental football-specific skills and physical parameters. Although different results were obtained when different features were included in the model, the literature emphasizes that subjective selection processes, particularly in high-dimensional or complex datasets, may lead to certain informative features being overlooked or certain types of variables being systematically favoured43. Therefore, in future studies, complementing expert judgement with algorithmic feature selection methods (such as Recursive Feature Elimination or embedded methods like LASSO and tree-based models) could enhance both model performance and the transparency of variable selection. Indeed, Saeys et al. (2007) emphasize that the combined use of filter, wrapper, and embedded feature selection strategies can strengthen both generalizability and model interpretability44. Similarly, recent reviews also indicate that integrating domain knowledge with data-driven methods is the most appropriate approach in terms of both reducing bias and increasing model interpretability45. In this context, combining the expert-based variable selection approach followed in the current study with algorithmic feature selection procedures in the future is considered an area for development that will further strengthen methodological robustness and transparency.

This study has some important limitations. A significant limitation of our study is that it was conducted on a specific age group, a specific race, and only on males. Therefore, to increase the generalisability of our findings, there is a need for studies with a wider range of participants, analysing athletes of different genders, ages, and races. The variables used in the study are largely based on technical performance tests; however, playing position-specific characteristics are not limited to technical skills alone. The fact that performance components such as decision-making, game reading, playing positioning, acceleration, agility, and aerobic/anaerobic capacity were not measured may have limited the ability of the classification models to distinguish between certain playing positions. Furthermore, the greater overlap in the technical performance profiles of Defenders and Midfielders, in particular, made it difficult for the model to distinguish between these two groups. Conversely, the more distinct profile of Forward players in variables such as 20 m dribbling, shooting, and weight contributed to the model’s higher accuracy in recognising this playing position. This situation may be attributed to playing position-specific skills not yet being fully established in younger age groups (15–17 years) and some playing positions naturally producing more similar technical outputs. Finally, feature selection in the study was based on conceptual evaluation rather than algorithmic methods; therefore, studies employing more advanced feature engineering, broader test batteries, and larger samples would provide a more robust foundation for the findings. These limitations indicate that caution should be exercised when generalising the results to different samples.

Conclusion

This study has demonstrated that playing position-specific technical performance indicators in young footballers can be successfully analysed using machine learning models. The SVM model showed particularly high accuracy for the forward playing position; however, technical similarities between defenders and midfielders made it difficult to classify these two groups. The findings suggest that playing position-specific performance differences may not be fully differentiated at a young age and that technical tests reflect certain playing positional characteristics to a limited extent. Nevertheless, the study reveals that data-driven models have significant potential to support coaching decisions and can provide an objective contribution to playing position determination processes at an early stage. Future research utilising broader and more diverse datasets will enhance model performance.

Acknowledgements

We would like to thank Princess Nourah bint Abdulrahman University for supporting this project through Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R286), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Author contributions

Conceptualization: Z.I., Y.S., U.M., B.Y., A.K, and M. I. A.; data curation: Z.I., Y.S., U.M., Z.Y., S.B., B.Y., A.K; formal analysis: Z.I., S.R., A.K., ; methodology: Z.I., Y.S., U.M., A.K., and M. I. A.; writing—original draft: Z.I., Y.S., U.M., Z.Y., S.B., B.Y., S.R., A.K and M. I. A.; writing—review and editing: Z.I., Y.S., U.M., Z.Y., S.B., B.Y., S.R., A.K and M.I.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R286), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. The study sponsor had no role in the data analysis or collection, writing of the report, or decision to submit the manuscript for publication.

Data availability

Data are available for research purposes from the corresponding author upon reasonable request. Individual de-identified participant data, statistical codes, and additional materials supporting the findings of this study are available upon reasonable request from the corresponding author of this paper.

Declarations

Competing interests

The authors declare no competing interests.

Ethics approval and consent to participate

The study was approved by Faculty of Physical Education and Basic Military Sciences Ethics Committee with decision number 2024/11. In addition, all participants and their families were informed about the purpose, reason, and possible contributions of the study to the literature and consent forms were signed by both participants and their families. The study was conducted in accordance with the principles set out in the Declaration of Helsinki.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Ahmet Kurtoğlu, Email: kurtogluahmet18@gmail.com.

Monira I. Aldhahi, Email: mialdhahi@pnu.edu.sa

References

  • 1.Mou, C. The attention mechanism performance analysis for football players using the internet of things and deep learning. IEEE Access.12, 4948–4957 (2024). [Google Scholar]
  • 2.Martín-Castellanos, A. et al. How do the football teams play in laliga? Analysis and comparison of playing styles according to the outcome. Int. J. Perform. Anal. Sport. 24, 18–30 (2024). [Google Scholar]
  • 3.Richter, C., O’Reilly, M. & Delahunt, E. Machine learning in sports science: challenges and opportunities. Sports Biomech.23, 961–967 (2024). [DOI] [PubMed] [Google Scholar]
  • 4.Ross, G. B., Clouthier, A. L., Boyle, A., Fischer, S. L. & Graham, R. B. Comparison of machine learning classifiers for differentiating level and sport using movement data. J. Sports Sci.40, 2166–2172 (2022). [DOI] [PubMed] [Google Scholar]
  • 5.Kurtoğlu, A. et al. The role of morphometric characteristics in predicting 20-meter sprint performance through machine learning. Sci. Rep.14, 16593 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Silacci, A., Taiar, R. & Caon, M. Towards an AI-Based tailored training planning for road cyclists: A case study. Appl. Sci.11, 313 (2020). [Google Scholar]
  • 7.Lin, L. S., Kao, C. H., Li, Y. J., Chen, H. H. & Chen, H. Y. Improved support vector machine classification for imbalanced medical datasets by novel hybrid sampling combining modified mega-trend-diffusion and bagging extreme learning machine model. Math. Biosci. Eng.20, 17672–17701 (2023). [DOI] [PubMed] [Google Scholar]
  • 8.Dijkhuis, T. B., Kempe, M. & Lemmink, K. A. P. M. Early prediction of physical performance in elite soccer Matches—A machine learning approach to support substitutions. Entropy23, 952 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Nassis, G., Verhagen, E., Brito, J., Figueiredo, P. & Krustrup, P. A review of machine learning applications in soccer with an emphasis on injury risk. Biol. Sport. 40, 233–239 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Pillitteri, G. et al. Relationship between external and internal load indicators and injury using machine learning in professional soccer: a systematic review and meta-analysis. Res. Sports Med.1–3710.1080/15438627.2023.2297190 (2023). [DOI] [PubMed]
  • 11.Rico-González, M., Pino-Ortega, J., Méndez, A., Clemente, F. & Baca, A. Machine learning application in soccer: a systematic review. Biol. Sport. 40, 249–263 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Schuth, G. et al. Football movement Profile–Based Creatine-Kinase prediction performs similarly to global positioning System–Derived machine learning models in National-Team soccer players. Int. J. Sports Physiol. Perform.1–810.1123/ijspp.2024-0077 (2024). [DOI] [PubMed]
  • 13.Berrar, D., Lopes, P. & Dubitzky, W. Incorporating domain knowledge in machine learning for soccer outcome prediction. Mach. Learn.108, 97–126 (2019). [Google Scholar]
  • 14.Memmert, D., Lemmink, K. A. P. M. & Sampaio, J. Current approaches to tactical performance analyses in soccer using position data. Sports Med.47, 1–10 (2017). [DOI] [PubMed] [Google Scholar]
  • 15.Hewitt, J. H. & Karakuş, O. A machine learning approach for player and position adjusted expected goals in football (soccer). Frankl. Open.4, 100034 (2023). [Google Scholar]
  • 16.Manish., S., Bhagat, V. & Pramila, R. Prediction of Football Players Performance using Machine Learning and Deep Learning Algorithms. in 2nd International Conference for Emerging Technology (INCET) 1–5 (IEEE, 2021). 1–5. 10.1109/INCET51464.2021.9456424 (2021).
  • 17.Cortez, A., Trigo, A. & Loureiro, N. Football match Line-Up prediction based on physiological variables: A machine learning approach. Computers11, 40 (2022). [Google Scholar]
  • 18.Bongiovanni, T. et al. How do football playing positions differ in body composition? A first insight into white Italian Serie A and Serie B players. J. Funct. Morphol. Kinesiol.8, 80 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Teixeira, J. E. et al. Effects of match Location, quality of opposition and match outcome on match running performance in a Portuguese professional football team. Entropy23, 973 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Michailidis, Y. The relationship between aerobic Capacity, anthropometric Characteristics, and performance in the Yo-Yo intermittent recovery test among elite young football players: differences between playing positions. Appl. Sci.14, 3413 (2024). [Google Scholar]
  • 21.Faul, F., Erdfelder, E., Lang, A. G. & Buchner, A. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav. Res. Methods. 39, 175–191 (2007). [DOI] [PubMed] [Google Scholar]
  • 22.Köksal, M., Gül, G. K. & Doğanay, M. & Álvarez-Garcia, C. Effects of coordination training on the technical development in 10-/13-year-old football players. J. Sports Med. Phys. Fitness61, (2021). [DOI] [PubMed]
  • 23.Ibrahim, C., Kuan, G., Muhamad, A. S., Kueh, Y. C. & Chin, N. S. The effect of virtual reality imagery on motivation and football kicking skill performance among youth football players in Sarawak. 57–70. 10.1007/978-981-19-8159-3_5 (2023).
  • 24.Arslan, Y. & Ermiş, E. The effects of life kinetic exercises on technical skills and motor skills performance in young football players. Eur. J. Phys. Educ. Sport Sci.9, (2023).
  • 25.Russell, M., Benton, D. & Kingsley, M. Reliability and construct validity of soccer skills tests that measure passing, shooting, and dribbling. J. Sports Sci.28, 1399–1408 (2010). [DOI] [PubMed] [Google Scholar]
  • 26.Cui, C. Player detection based on support vector machine in football videos. Int. J. Perform. Eng.10.23940/ijpe.18.02.p12.309319 (2018). [Google Scholar]
  • 27.Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn.20, 273–297 (1995). [Google Scholar]
  • 28.Halder, R. K., Uddin, M. N., Uddin, M. A., Aryal, S. & Khraisat, A. Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications. J. Big Data. 11, 113 (2024). [Google Scholar]
  • 29.Adem, K. Diagnosis of breast cancer with stacked autoencoder and subspace kNN. Phys. A: Stat. Mech. Its Appl.551, 124591 (2020). [Google Scholar]
  • 30.Nusinovici, S. et al. Logistic regression was as good as machine learning for predicting major chronic diseases. J. Clin. Epidemiol.122, 56–69 (2020). [DOI] [PubMed] [Google Scholar]
  • 31.Levy, J. J. & O’Malley, A. J. Don’t dismiss logistic regression: the case for sensible extraction of interactions in the era of machine learning. BMC Med. Res. Methodol.20, 171 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Herold, M. et al. Machine learning in men’s professional football: current applications and future directions for improving attacking play. Int. J. Sports Sci. Coach. 14, 798–817 (2019). [Google Scholar]
  • 33.Ćwiklinski, B., Giełczyk, A. & Choraś, M. Who will score? A machine learning approach to supporting football team Building and transfers. Entropy23, 90 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Frangoudes, F., Matsangidou, M., Schiza, E. C., Neokleous, K. & Pattichis, C. S. Assessing human motion during exercise using machine learning: A literature review. IEEE Access.10, 86874–86903 (2022). [Google Scholar]
  • 35.Chen, R. C., Dewi, C., Huang, S. W. & Caraka, R. E. Selecting critical features for data classification based on machine learning methods. J. Big Data. 7, 52 (2020). [Google Scholar]
  • 36.Dai, H. Research on SVM improved algorithm for large data classification. in. IEEE 3rd International Conference on Big Data Analysis (ICBDA) 181–185 (IEEE, 2018). 10.1109/ICBDA.2018.8367673 (2018).
  • 37.Pedretti, A., Pedretti, A., De Oliveira Fernandes, J. B. & Rebelo, C. A. N. & Teixeira Seabra, A. F. The relative age effects in young soccer players and it relations with the competitive level, specific position, morphological characteristics, physical fitness and technical skills. Pensar a Prática19, (2016).
  • 38.Joo, C. H. & Seo, D. I. Analysis of physical fitness and technical skills of youth soccer players according to playing position. J. Exerc. Rehabil. 12, 548–552 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Razali, N., Mustapha, A. & Yatim, F. A. Ab Aziz, R. Predicting player position for talent identification in association football. IOP Conf. Ser. Mater. Sci. Eng.226, 012087 (2017). [Google Scholar]
  • 40.Sander Utomo, K. & Wiradinata, T. Optimal playing position prediction in football matches: A machine learning approach. Int. J. Inform. Eng. Electron. Bus.15, 30–47 (2023). [Google Scholar]
  • 41.Gonçalves, B. V., Figueira, B. E., Maçãs, V. & Sampaio, J. Effect of player position on movement behaviour, physical and physiological performances during an 11-a-side football game. J. Sports Sci.32, 191–199 (2014). [DOI] [PubMed] [Google Scholar]
  • 42.Beato, M., Youngs, A. & Costin, A. J. The analysis of physical performance during official competitions in professional english football: do Positions, game Locations, and results influence players’ game demands? J. Strength. Cond Res.38, e226–e234 (2024). [DOI] [PubMed] [Google Scholar]
  • 43.Łukaszuk, T. & Krawczuk, J. Importance of feature selection stability in the classifier evaluation on high-dimensional genetic data. PeerJ12, e18405 (2024). [Google Scholar]
  • 44.Saeys, Y., Inza, I. & Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics23, 2507–2517 (2007). [DOI] [PubMed] [Google Scholar]
  • 45.Pudjihartono, N., Fadason, T., Kempa-Liehr, A. W. & O’Sullivan, J. M. A review of feature selection methods for machine Learning-Based disease risk prediction. Front. Bioinf.2, (2022). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data are available for research purposes from the corresponding author upon reasonable request. Individual de-identified participant data, statistical codes, and additional materials supporting the findings of this study are available upon reasonable request from the corresponding author of this paper.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES