Abstract
The Korea Baseball Organization (KBO) introduced the foreign player system in 1998 to enhance league competitiveness. In 2014, the number of foreign players increased to three per team. Since then, the ten KBO teams have routinely included two foreign pitchers and one batter. While the performance of foreign players significantly impacts the post-season qualification of the team, the contract renewal rates for pitchers and batters are only 34 % and 36 %, respectively. Therefore, a method that can aid in the contract renewal decision can help teams recruit high-caliber foreign players, improve their performance, raise the level of KBO league, and provide an enjoyable experience for baseball fans. In this study, we use machine learning methods to predict the contract renewal decision and compare the prediction performances of various models. We use data on foreign player performances in the Minor League Baseball and Major League Baseball immediately prior to joining KBO, in KBO upon joining, and player image data. By comparing the accuracy, area under the receiver-operating characteristic curve, and precision of prediction results based on performance in each league, we find that performance in KBO plays a significant role in improving the prediction. Additionally, a post-hoc analysis of batters reveals a gradual decline in the performance level of foreign batters who succeeded in KBO, which is found to be related to the results of international baseball tournaments. In conclusion, the proposed approach to player performance evaluation and contract renewal decisions can contribute to the long-term success of the teams and league.
Keywords: Contract renewal, Foreign players, International baseball tournaments, Korea baseball organization, Machine learning
1. Introduction
In May 2020, during the COVID-19 pandemic, ESPN started broadcasting games from the Korea Baseball Organization (KBO) live. This move sparked a surge of interest in KBO among fans and the American sports industry. Korean players like Chan-ho Park, Hyun-jin Ryu, and Jung-ho Kang, who had previously demonstrated their skills in Major League Baseball (MLB), drew significant attention from the American sports community and fans. Their performances fueled discussions on American media and social platforms. In June 2020, the Asia Society, an international academic research organization, hosted an online event themed “The Fascinating World of Korean Professional Baseball.” During this event, Matt Schiavenza, the content director of the organization, expressed his admiration for Korean baseball. He stated, “Baseball fans around the world now have the opportunity to witness one of the best professional baseball leagues in the world, the KBO” [1].
KBO has a 10-team structure. Similarly to MLB, KBO implements the designated hitter system, allowing a total of ten player positions in the game. To promote league competitiveness and provide fresh entertainment for baseball fans, KBO introduced a foreign player system in 1998, enabling each team to have up to two foreign players. Initially, most teams recruited two foreign pitchers. However, since 2014, the limit for foreign players has been increased to three per team, allowing the recruitment of foreign batters. Currently, each team commonly adopts a configuration of two pitchers and one batter, which is the prevailing practice in the league.
Each KBO team strives to attract audiences and generate profits by targeting postseason qualification, regular season championships, and the Korean Series championships. An examination of the Wins Above Replacement (WAR) records for the past eight years, maintained by Statiz, for each position reveals that teams with the top five pitchers have a 77.5 % probability of making the postseason, whereas teams with top-ranking shortstops have a 72.5 % probability, indicating the significant impact of pitchers [2]. Because KBO operates with a five-man starting rotation, with two spots typically filled by foreign pitchers, the importance of scouting for talented players becomes even more crucial.
Foreign KBO players do not sign multi-year contracts; instead, their contracts are renewed every season. If a player's performance is deemed unsatisfactory, the team must still pay the full contract amount upon their release, resulting in financial losses. Therefore, acquiring talented players is crucial for teams to avoid losses and increase their probability of postseason qualification. A team's decision to engage in contract renewal with a player can be interpreted as a sign of satisfaction with their performance and can be considered a successful scouting effort. In fact, over the past eight years, the probability of a team making a postseason when a player who debuted in KBO renews their contract is approximately 64 % for batters and 50.85 % for pitchers [2,3]. If at least one out of the three foreign players secures contract renewal, the probability of post-season qualification exceeds 50 %.
Every year, each KBO team dispatches scouts to recruit foreign players who can perform well in the league. However, the contract renewal rates for foreign pitchers and batters have been 34 % and 36 %, respectively, for the 2014–2022 seasons, highlighting the challenges faced by scouts in finding foreign players who could succeed in KBO [3]. This clearly demonstrates that scouts from KBO teams are uncertain about which evaluation metrics from MLB and Minor League Baseball (MiLB) are particularly important for scouting. The importance of sabermetrics was highlighted in the movie “Moneyball” and Michael Lewis's book, “Moneyball: The Art of Winning an Unfair Game,” emphasizing the significance of data analysis in the field of sports [4]. Furthermore, with the development of the sports gaming industry, the use of analytical tools to predict the outcomes of sports games has become increasingly important [5]. Recent studies have used data from sources such as Baseball-Reference, FanGraphs, Baseball Savant, and the Professional Baseball Transactions Archive [[6], [7], [8], [9]] to conduct research using machine learning and deep learning techniques. Huang and Li [10] proposed the use of the artificial neural network (ANN), support vector machine (SVM), and one-dimensional convolutional neural network for win–loss prediction. Yaseen et al. [11] used logistic regression (LR) and SVM to predict playoff qualifications. Valero [12] used the ANN, SVM, Decision Tree, and K-NN models to predict the results of MLB games. Elfrink [13] constructed models using random forest (RF), XGBoost, linear models, and boosted LR to predict game outcomes.
As is evident from the aforementioned studies, research in the field of baseball has primarily focused on predicting game outcomes. Similar research has been performed in other sports disciplines, again predominantly focusing on predicting game outcomes [14,15]. However, this study focuses on predicting the performance of foreign players, which plays a vital role in determining team rankings, postseason qualifications, and contract renewals in KBO. Moreover, with the increase in the popularity of KBO among foreign players, inferring the performance required for foreign players to secure contract renewals is also necessary. This study focuses on the performance levels of foreign players who have undergone contract renewals and their correlation with the international achievements of the South Korean baseball teams. Experimental results suggest that player performance in KBO plays a significant role in improving the prediction.
The aim of this study is to elucidate the impact of foreign players' performance in KBO on contract renewal decisions, with the goal of deriving valuable insights for scouting and team management. Through this research, we aim to address the following questions: Firstly, which performance metrics are particularly crucial in scouting? Secondly, do KBO performances of foreign players influence scouting and contract renewal decisions? Lastly, can the findings of this study be applied to overseas baseball leagues, including but not limited to the United States, Japan, and Australia, which recruit foreign players? The implementation of these findings can provide a robust framework for player recruitment, reducing the risk of financial losses associated with scouting failures and fostering a consistent performance in international competitions.
2. Materials and methods
2.1. Machine learning models
Five machine learning models were used in this study. Among them, LR applies linear regression to binary classification problems. It applies a linear combination of input variables and weights to a sigmoid function, producing probability values between 0 and 1. LR is used to classify data based on a decision boundary [16]. RF is an ensemble learning method that constructs multiple decision trees to make predictions. Each tree is trained on randomly selected data, and the predictions are combined by ensembling their results. RF is effective for reducing overfitting and improving prediction performance [17]. XGBoost is an ensemble learning model based on gradient-boosting trees. It iteratively trains trees to minimize the loss function and utilizes the prediction results in the training of the next tree to improve prediction performance. It is known for its excellent prediction performance and scalability [18]. LightGBM is a lightweight machine learning framework that uses efficient methods to grow trees by considering the characteristics and distribution of data. It provides fast training and prediction speeds and demonstrates excellent performance even on large-scale datasets [19]. CatBoost is a gradient-boosting algorithm that handles categorical variables. It performs automatic transformations of categorical variables and balances the tree structure to improve prediction performance. It provides unique methods for handling categorical variables, reduces data bias, and prevents overfitting [20].
2.2. VGGFace
The objective of VGGFace, proposed by Parkhi et al. [21], is to achieve human-level performance in the task of face recognition. VGGFace has demonstrated excellent performance on the Labeled Faces in the Wild dataset, which consists of diverse face images collected from around the world [22], and the YouTube Faces Database, which is composed of various face images extracted from YouTube videos [23]. It is a deep learning-based model used for face image classification and face feature vector extraction. This model is based on the VGGNet architecture and learns diverse facial features using deep neural networks and convolutional layers with filters. It outperforms other face-recognition algorithms and has proven its potential for practical applications. The VGGFace model can be configured using one of the following model parameters: VGG16, ResNet50, or Senet50. This choice influences the performance and characteristics of the model. The model extracts features by passing input images of size 224 × 224 through multiple convolutional layers. The output of the last convolutional layer is fed to fully connected layers to generate features based on the chosen model parameters. In this study, optimization was performed based on these parameter values.
2.3. Data sources and collection
The MLB, MiLB, and KBO data used in this study—Korea Baseball, Statiz, Baseball-Reference, FanGraphs, and Baseball Savant—are publicly available and validated datasets and have been previously used in related research [3,4,[6], [7], [8]]. We used Korea Baseball and Statiz for data on KBO player performances for the season immediately prior to entry, team rankings upon entry, and injury history. Injuries were encoded and considered only in cases where players were listed as injured for more than four weeks. FanGraphs and Baseball Savant were used for MLB and MiLB performance data, as well as tracking data, for the season immediately prior to entry to KBO. If no MLB or MiLB records were available for the season immediately prior to entry to KBO, records from the past three years were used. Players with no records within the past three years were excluded from the analysis. Finally, the data of 123 pitchers and 77 hitters were selected for this study. Among them, 43 pitchers (35 %) and 28 hitters (37 %) secured contract renewals.
According to Girdhar et al. [24], merging image and numerical data improves prediction performance. Therefore, in this study, we compared the performances of using only numerical data and merging image and numerical data by collecting player profile picture data from Baseball-Reference and performing optimization.
Two virtual environments were used in this study. The first environment was utilized to embed image data, and the second was designated for preprocessing and modeling. In the first environment, Python 3.7.16, Keras 2.3.1, and TensorFlow 1.14.0 were employed. The second environment incorporated XGBoost 1.5.0, LightGBM 3.2.1, CatBoost 1.0.6, Python 3.9.13, scikit-learn 1.1.0 (LR, RF), and SHAP 0.41.0. The model architectures can also be examined on our GitHub repository at https://github.com/Ptaeshin/Prediction-of-Re-signing-Foreign-Players/tree/master.
2.4. Research framework
The objective of this study was to predict the contract renewals of foreign players in KBO. The framework followed is shown in Fig. 1 and is implemented for both pitchers and hitters. During the data collection phase, three main components were collected: MLB or MiLB data prior to KBO entry, KBO entry season data, and face image data. In the preprocessing phase, standardization was performed using season-specific data from FanGraphs for both MLB and MiLB. For instance, when comparing FIP values for the 2015 and 2016 seasons, even if both values are 3.0, the data's mean and standard deviation differ between seasons. Therefore, separate standardization was conducted for both pitchers and hitters, utilizing the complete player data for each season in MLB and MiLB. While the MLB and MiLB provide data for all players per season, certain evaluation metrics in KBO are only available for the top 30 players, making standardization difficult. Therefore, this step was omitted. Image data were embedded using the VGGFace model. The modeling phase involved the use of LR and four tree-based models.
Fig. 1.
Research framework.
Based on modeling results, performance improvement techniques were applied by merging numerical and image data. A comparison was made between the original embedding results of the VGGFace model and the dimensionality reduction of image data using principal component analysis (PCA) and kernel PCA.
3. Results
In this study, Python was used to perform fivefold cross-validation with fixed seed numbers, using MLB records, MiLB records, and image data as independent variables and contract renewal status as the dependent variable. Five models, namely, LR, RF, XGBoost, LightGBM, and CatBoost classification, were used for the analyses. It is worth noting that, for all models, default hyperparameter settings consistently yielded the best performance, outperforming model-specific hyperparameter tuning. Accuracy, area under the receiver-operating characteristic curve (AUC), and precision were used as evaluation metrics. Furthermore, for performance improvement, label encoding was performed based on the presence of an injury history lasting for more than four weeks during the player's tenure in KBO. Additionally, team performance was considered an independent variable because teams may seek to recruit higher-level players based on their team rankings. While the injury history prior to joining KBO is an important factor in scouting [25,26], collecting the MiLB injury records at the level of KBO scouting is challenging because the Pro Sports Transactions Archives only provide MLB injury records. Based on the results of the analysis, Shapley Additive Explanations (SHAP) was used to examine the important features. SHAP is an algorithm used to explain the predictions of machine learning models by evaluating the contribution of each feature to the prediction. Inferring trends at KBO level was achieved by performing data analysis on features of high importance. Furthermore, owing to variations in the average and standard deviation of records across seasons, data were standardized based on the season before making predictions.
3.1. Re-contracting prediction based on MiLB records
For the MiLB pitcher data, modeling was performed with 29 independent variables collected from Fangraphs and contract renewal status obtained from Statiz as the dependent variable. Among the 123 pitchers, those who had no MiLB (AAA) records in the three seasons preceding their KBO entry season or who pitched fewer than 30 innings in MiLB or KBO were excluded. Therefore, the data of only 103 players were analyzed. For MiLB hitter data, modeling was performed with 22 independent variables obtained from Fangraphs and contract renewal status obtained from Statiz as the dependent variable. Among the 77 players, those with fewer than 30 plate appearances in KBO were excluded. Consequently, the data of only 76 players were used for modeling.
3.1.1. Nonstandardization
The results of modeling without season-wise standardization preprocessing are presented in Table 1. For both pitchers and batters, the LR model outperformed the other models in terms of all three metrics. However, the model performance could not be evaluated as good based on the accuracy and AUC evaluation metrics. Furthermore, the precision values for pitchers and batters were 0.433 and 0.507, respectively, which were higher than the contract renewal rates of 0.35 for pitchers and 0.37 for batters. We discuss these findings in more detail in Section 4.
Table 1.
Prediction Results based on Nonstandardized MiLB Records (Accuracy: ACC; Precision: PRE).
Models | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LR |
RF |
XGBoot |
LightGBM |
CatBoost |
||||||||||||
CV | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | |
Pitcher | 1 | 0.667 | 0.529 | 0.571 | 0.429 | 0.25 | 0 | 0.381 | 0.135 | 0 | 0.524 | 0.240 | 0.25 | 0.429 | 0.240 | 0 |
2 | 0.476 | 0.529 | 0.2 | 0.524 | 0.254 | 0.333 | 0.476 | 0.413 | 0.333 | 0.476 | 0.404 | 0.256 | 0.524 | 0.529 | 0.333 | |
3 | 0.619 | 0.558 | 0.5 | 0.714 | 0.514 | 0.4 | 0.571 | 0.510 | 0.4 | 0.571 | 0.413 | 0.429 | 0.571 | 0.490 | 0.4 | |
4 | 0.7 | 0.615 | 0.667 | 0.6 | 0.505 | 0 | 0.65 | 0.538 | 0.5 | 0.55 | 0.440 | 0.333 | 0.7 | 0.549 | 0.667 | |
5 | 0.5 | 0.438 | 0.25 | 0.55 | 0.323 | 0.5 | 0.5 | 0.592 | 0.333 | 0.4 | 0.271 | 0.167 | 0.55 | 0.344 | 0.333 | |
Average | 0.592 | 0.533 | 0.438 | 0.563 | 0.423 | 0.247 | 0.516 | 0.378 | 0.313 | 0.504 | 0.354 | 0.293 | 0.554 | 0.431 | 0.347 | |
Batter | 1 | 0.625 | 0.6 | 0.5 | 0.563 | 0.45 | 0.25 | 0.5 | 0.417 | 0.375 | 0.625 | 0.583 | 0.5 | 0.438 | 0.467 | 0.2 |
2 | 0.667 | 0.611 | 0.667 | 0.533 | 0.491 | 0.333 | 0.467 | 0.463 | 0.25 | 0.467 | 0.472 | 0 | 0.533 | 0.5 | 0.333 | |
3 | 0.6 | 0.611 | 0.5 | 0.6 | 0.556 | 0.5 | 0.533 | 0.593 | 0.333 | 0.6 | 0.630 | 0.5 | 0.467 | 0.630 | 0 | |
4 | 0.733 | 0.48 | 0.667 | 0.667 | 0.98 | 1 | 0.733 | 0.82 | 0.667 | 0.733 | 0.8 | 1 | 0.8 | 0.96 | 1 | |
5 | 0.467 | 0.47 | 0.2 | 0.467 | 0.61 | 0.2 | 0.533 | 0.63 | 0.333 | 0.533 | 0.61 | 0.333 | 0.533 | 0.65 | 0.333 | |
Average | 0.618 | 0.554 | 0.507 | 0.566 | 0.617 | 0.457 | 0.553 | 0.584 | 0.392 | 0.592 | 0.619 | 0.467 | 0.554 | 0.641 | 0.373 |
3.1.2. Preprocessing—standardization
The results of modeling with standardized MiLB data by season are presented in Table 2. Generally, performance decreased for pitchers, except in the LR model. Similarly, the performance of batters declined as well. However, the precision value for batters was 0.477, which was higher than the contract renewal rate of 0.37 for batters.
Table 2.
Prediction Results based on Standardized MiLB Records (Accuracy: ACC; Precision: PRE).
Models | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LR |
RF |
XGBoot |
LightGBM |
CatBoost |
||||||||||||
CV | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | |
Pitcher | 1 | 0.333 | 0.337 | 0.2 | 0.429 | 0.361 | 0.333 | 0.476 | 0.337 | 0.333 | 0.381 | 0.25 | 0.273 | 0.429 | 0.337 | 0.3 |
2 | 0.524 | 0.5 | 0.25 | 0.619 | 0.471 | 0.4 | 0.667 | 0.548 | 0.571 | 0.762 | 0.587 | 0.714 | 0.667 | 0.510 | 0.6 | |
3 | 0.333 | 0.356 | 0.2 | 0.571 | 0.312 | 0.333 | 0.524 | 0.404 | 0.25 | 0.476 | 0.462 | 0 | 0.571 | 0.385 | 0.333 | |
4 | 0.5 | 0.297 | 0.286 | 0.55 | 0.269 | 0.333 | 0.55 | 0.395 | 0.333 | 0.65 | 0.505 | 0.5 | 0.6 | 0.495 | 0 | |
5 | 0.5 | 0.5 | 0.333 | 0.55 | 0.417 | 0 | 0.6 | 0.458 | 0.5 | 0.45 | 0.406 | 0.286 | 0.55 | 0.512 | 0.4 | |
Average | 0.438 | 0.398 | 0.253 | 0.544 | 0.367 | 0.280 | 0.563 | 0.428 | 0.400 | 0.544 | 0.442 | 0.355 | 0.563 | 0.453 | 0.327 | |
Batter | 1 | 0.688 | 0.667 | 0.6 | 0.438 | 0.408 | 0 | 0.375 | 0.383 | 0.167 | 0.5 | 0.517 | 0.333 | 0.5 | 0.333 | 0 |
2 | 0.6 | 0.611 | 0.5 | 0.533 | 0.472 | 0.333 | 0.533 | 0.519 | 0.333 | 0.4 | 0.546 | 0 | 0.467 | 0.519 | 0 | |
3 | 0.533 | 0.593 | 0.333 | 0.533 | 0.481 | 0.333 | 0.467 | 0.556 | 0.333 | 0.6 | 0.5 | 0.5 | 0.533 | 0.556 | 0.333 | |
4 | 0.8 | 0.72 | 0.75 | 0.8 | 0.97 | 1 | 0.733 | 0.8 | 0.667 | 0.8 | 0.7 | 1 | 0.8 | 0.94 | 1 | |
5 | 0.467 | 0.54 | 0.2 | 0.467 | 0.59 | 0.25 | 0.6 | 0.7 | 0.429 | 0.533 | 0.6 | 0.25 | 0.467 | 0.56 | 0.2 | |
Average | 0.618 | 0.627 | 0.477 | 0.554 | 0.584 | 0.383 | 0.542 | 0.591 | 0.386 | 0.567 | 0.573 | 0.417 | 0.553 | 0.581 | 0.307 |
3.2. Re-contracting prediction based on MLB records
For the MLB pitcher data, we used 20 independent variables obtained from Baseball Savant. Of the 123 players, MLB records for only 66 players were available and were included in the analysis. For the batter data, we used 18 independent variables obtained from Baseball Savant. Of the 77 players, 45 had MLB records and were included in the analysis.
3.2.1. Nonstandardization
The results of modeling using the MLB data without standardization preprocessing are listed in Table 3. For pitchers, no significant difference was found when compared with that using the MiLB data. However, for batters, the performance was lower than that when using MiLB data as an independent variable.
Table 3.
Prediction Results based on Nonstandardized MLB Records (Accuracy: ACC; Precision PRE).
Models | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LR |
RF |
XGBoot |
LightGBM |
CatBoost |
||||||||||||
CV | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | |
Pitcher | 1 | 0.286 | 0.311 | 0 | 0.714 | 0.311 | 0 | 0.714 | 0.622 | 0.667 | 0.643 | 0.578 | 0 | 0.5 | 0.556 | 0 |
2 | 0.462 | 0.425 | 0.25 | 0.538 | 0.5 | 0.25 | 0.462 | 0.425 | 0.25 | 0.692 | 0.575 | 0.667 | 0.462 | 0.425 | 0.25 | |
3 | 0.231 | 0.15 | 0 | 0.538 | 0.275 | 0 | 0.538 | 0.275 | 0.333 | 0.462 | 0.525 | 0.25 | 0.615 | 0.4 | 0.5 | |
4 | 0.462 | 0.375 | 0 | 0.615 | 0.575 | 0 | 0.692 | 0.5 | 0.667 | 0.462 | 0.425 | 0 | 0.692 | 0.625 | 1 | |
5 | 0.615 | 0.556 | 0.333 | 0.462 | 0.389 | 0 | 0.615 | 0.417 | 0.4 | 0.538 | 0.556 | 0.25 | 0.385 | 0.333 | 0 | |
Average | 0.411 | 0.363 | 0.117 | 0.574 | 0.410 | 0.05 | 0.604 | 0.448 | 0.463 | 0.559 | 0.532 | 0.233 | 0.531 | 0.468 | 0.35 | |
Batter | 1 | 0.667 | 0.25 | 0 | 0.556 | 0.139 | 0 | 0.444 | 0.25 | 0 | 0.667 | 0.5 | 0 | 0.556 | 0.25 | 0 |
2 | 0.444 | 0.556 | 0.25 | 0.333 | 0.25 | 0.25 | 0.333 | 0.333 | 0 | 0.667 | 0.5 | 0 | 0.444 | 0.167 | 0.25 | |
3 | 0.556 | 0.722 | 0.333 | 0.556 | 0.556 | 0 | 0.667 | 0.5 | 0.5 | 0.667 | 0.5 | 0 | 0.556 | 0.5 | 0 | |
4 | 0.444 | 0.278 | 0 | 0.556 | 0.333 | 0 | 0.667 | 0.444 | 0.5 | 0.667 | 0.5 | 0 | 0.556 | 0.444 | 0 | |
5 | 0.667 | 0.611 | 0.5 | 0.667 | 0.611 | 0.5 | 0.444 | 0.556 | 0.25 | 0.667 | 0.5 | 0 | 0.667 | 0.5 | 0.5 | |
Average | 0.556 | 0.483 | 0.217 | 0.533 | 0.378 | 0.15 | 0.511 | 0.417 | 0.25 | 0.667 | 0.5 | 0 | 0.556 | 0.372 | 0.15 |
3.2.2. Preprocessing—standardization
The results of modeling using MLB data standardized by season are presented in Table 4. Even after standardization, no significant difference was found when compared with the result obtained using MiLB data. However, for batters, when using the LR model, precision was higher at 0.6 when compared with those in the other cases.
Table 4.
Prediction Results based on Standardized MLB Records (Accuracy: ACC; Precision: PRE).
Models | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LR |
RF |
XGBoot |
LightGBM |
CatBoost |
||||||||||||
CV | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | |
Pitcher | 1 | 0.5 | 0.467 | 0.333 | 0.643 | 0.689 | 0 | 0.786 | 0.711 | 1 | 0.643 | 0.422 | 0 | 0.756 | 0.711 | 1 |
2 | 0.538 | 0.375 | 0.4 | 0.615 | 0.475 | 0.333 | 0.615 | 0.525 | 0.5 | 0.462 | 0.4 | 0.25 | 0.615 | 0.525 | 0.5 | |
3 | 0.538 | 0.7 | 0 | 0.692 | 0.725 | 1 | 0.615 | 0.55 | 0.5 | 0.615 | 0.7 | 0.5 | 0.615 | 0.65 | 0.5 | |
4 | 0.462 | 0.2 | 0 | 0.462 | 0.138 | 0.25 | 0.308 | 0.15 | 0 | 0.308 | 0.025 | 0 | 0.385 | 0.2 | 0 | |
5 | 0.385 | 0.528 | 0.167 | 0.538 | 0.514 | 0 | 0.538 | 0.417 | 0 | 0.615 | 0.528 | 0.333 | 0.538 | 0.417 | 0 | |
Average | 0.485 | 0.454 | 0.18 | 0.590 | 0.508 | 0.317 | 0.573 | 0.471 | 0.4 | 0.529 | 0.415 | 0.217 | 0.588 | 0.501 | 0.4 | |
Batter | 1 | 0.444 | 0.361 | 0 | 0.556 | 0.361 | 0 | 0.556 | 0.194 | 0 | 0.667 | 0.5 | 0 | 0.556 | 0.25 | 0 |
2 | 0.889 | 0.722 | 1 | 0.667 | 0.5 | 0.5 | 0.556 | 0.667 | 0.4 | 0.677 | 0.5 | 0 | 0.667 | 0.556 | 0.5 | |
3 | 0.778 | 0.778 | 1 | 0.667 | 0.556 | 0 | 0.667 | 0.389 | 0 | 0.667 | 0.5 | 0 | 0.555 | 0.389 | 0 | |
4 | 0.778 | 0.333 | 1 | 0.556 | 0.556 | 0.5 | 0.444 | 0.5 | 0 | 0.667 | 0.5 | 0 | 0.667 | 0.444 | 0 | |
5 | 0.333 | 0.278 | 0 | 0.556 | 0.417 | 0 | 0.444 | 0.444 | 0 | 0.667 | 0.5 | 0 | 0.556 | 0.444 | 0 | |
Average | 0.644 | 0.494 | 0.6 | 0.6 | 0.478 | 0.2 | 0.533 | 0.439 | 0.08 | 0.667 | 0.5 | 0 | 0.6 | 0.417 | 0.1 |
3.3. Re-contracting prediction based on face image optimization
The VGGFace model supports various embedding structures based on model parameter values. Table 5 lists the modeling results for pitchers and batters based on various values for model parameters. When using Resnet50 and Senet50 as model parameters, 2048 features were generated, whereas using VGG16 resulted in 512 features. Analyses were performed for all 123 pitchers and 77 batters. For both pitchers and batters, no significant differences in accuracy or AUC were found based on the model parameters. However, for pitchers, the Resnet50 parameter achieved higher performance on all three evaluation metrics when compared with that using only MiLB or MLB data. Similarly, for batters, the Resnet50 parameter outperformed the other parameters in terms of performance.
Table 5.
Prediction Results based on Face Image (Accuracy: ACC; Precision: PRE).
Models | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LR |
RF |
XGBoot |
LightGBM |
CatBoost |
||||||||||||
Model Parameter |
ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | |
Pitcher | Resnet50 (2048 features) | 0.659 | 0.556 | 0.2 | 0.642 | 0.582 | 0.667 | 0.586 | 0.585 | 0.35 | 0.634 | 0.603 | 0.476 | 0.667 | 0.621 | 0.58 |
Senet50 (2048 features) | 0.659 | 0.443 | 0.2 | 0.610 | 0.561 | 0.433 | 0.552 | 0.465 | 0.347 | 0.576 | 0.478 | 0.317 | 0.618 | 0.556 | 0.467 | |
Vgg16 (512 features) | 0.651 | 0.538 | 0 | 0.634 | 0.579 | 0 | 0.634 | 0.526 | 0.538 | 0.578 | 0.507 | 0.358 | 0.659 | 0.516 | 0.2 | |
Batter | Resnet50 (2048 features) | 0.650 | 0.569 | 0.1 | 0.650 | 0.489 | 0.1 | 0.651 | 0.567 | 0.32 | 0.586 | 0.619 | 0.409 | 0.571 | 0.536 | 0.233 |
Senet50 (2048 features) | 0.624 | 0.565 | 0 | 0.624 | 0.585 | 0 | 0.546 | 0.443 | 0.301 | 0.634 | 0.517 | 0.481 | 0.624 | 0.590 | 0.2 | |
Vgg16 (512 features) | 0.650 | 0.5 | 0 | 0.650 | 0.484 | 0.3 | 0.610 | 0.5 | 0.38 | 0.557 | 0.463 | 0.280 | 0.611 | 0.467 | 0 |
3.4. Performance improvement techniques—image data fusion
Following the research by Girdhar et al. [24], prediction performance was enhanced by merging image and numerical data. Based on the results listed in Table 5, the Resnet50 parameter exhibited better performance than that of the other parameters; therefore, Resnet50 was selected. The analysis included 66 pitchers and 45 batters with MLB data. The variables were described as follows:
-
•
Face + Major: 2048 features obtained from embedding results were combined with major data variables (pitcher:20, batter:18).
-
•
Facial PCA + Major: 2048 features obtained from embedding results were dimensionally reduced using PCA. For pitchers, the cumulative contribution of the first and second principal components accounted for 84 % of the variance; for batters, their contribution accounted for 72 %. Therefore, the two principal components derived from the facial data through dimension reduction and major data variables were merged as independent variables.
-
•
Face-kernel PCA + Major: Similar to PCA, both pitcher and batter data underwent dimension reduction using kernel PCA with two principal components, which were then combined with the major data variables as independent variables.
The modeling results are listed in Table 6. For both pitchers and batters, no significant difference between the PCA and kernel PCA was found. For pitchers, no major changes in accuracy or AUC were found when image data were integrated; however, precision improved in the RF and XGBoost models. For batters, when the image and numerical data were directly merged without dimension reduction, significant performance improvements were observed in the accuracy, AUC, and precision metrics in the XGBoost model.
Table 6.
Prediction results with combined image and numerical data (accuracy: ACC; precision: PRE).
Subset of Variables | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Face + Major |
Face PCA + Major |
Face kernel PCA + Major |
||||||||
ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ||
Pitcher | LR | 0.560 | 0.466 | 0.239 | 0.484 | 0.409 | 0.187 | 0.485 | 0.409 | 0.187 |
RF | 0.635 | 0.487 | 0.207 | 0.559 | 0.415 | 0.407 | 0.544 | 0.563 | 0.567 | |
XGBoost | 0.634 | 0.478 | 0.5 | 0.499 | 0.478 | 0.267 | 0.499 | 0.478 | 0.267 | |
LightGBM | 0.530 | 0.311 | 0 | 0.499 | 0.388 | 0.133 | 0.499 | 0.388 | 0.133 | |
CatBoost | 0.604 | 0.514 | 0.25 | 0.573 | 0.501 | 0.35 | 0.573 | 0.501 | 0.35 | |
Batter | LR | 0.633 | 0.553 | 0.467 | 0.633 | 0.58 | 0.467 | 0.633 | 0.58 | 0.467 |
RF | 0.633 | 0.559 | 0.267 | 0.703 | 0.568 | 0.52 | 0.658 | 0.548 | 0.5 | |
XGBoost | 0.722 | 0.742 | 0.533 | 0.544 | 0.538 | 0.367 | 0.544 | 0.538 | 0.367 | |
LightGBM | 0.658 | 0.5 | 0 | 0.658 | 0.5 | 0 | 0.658 | 0.5 | 0 | |
CatBoost | 0.678 | 0.624 | 0.5 | 0.658 | 0.500 | 0.3 | 0.658 | 0.500 | 0.3 |
3.5. Re-contracting prediction based on KBO records
Apart from teams scouting for players, foreign players who aspire to play in KBO may also need to predict their expected performance to secure contract renewal. To address this, we merged the records provided by the official KBO website and Statiz to predict contract renewal. For pitchers, we analyzed 66 players with MLB records. Considering that pitchers are heavily influenced by both the offensive and defensive capabilities of the team, we used four independent variables to assess their individual abilities: ERA, WHIP, FIP, and WAR metrics (see Table 10, Table 11 for descriptions of these other metrics used in this study) [27]. For batters, we analyzed 45 players with MLB records and used 27 independent variables obtained from Statiz. Because KBO does not provide season-by-season records for individual players like MLB and MiLB, we were unable to perform season-based standardization. Therefore, we performed predictions without standardization. Additionally, we experimented with incorporating image data into KBO data to evaluate whether it would enhance the performance. The modeling results are presented in Table 7. The analysis revealed that the performance of pitchers on all three metrics declined when image data were added to KBO entry-season data. For batters, although accuracy and precision decreased, AUC exhibited a slight increase in some models.
Table 7.
Prediction results with combined image and KBO numerical data (accuracy: ACC; precision: PRE).
Subset of Variables | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
KBO |
KBO + Face |
KBO + Face PCA |
||||||||
ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ||
Pitcher | LR | 0.773 | 0.869 | 0.783 | 0.700 | 0.763 | 0.6 | 0.711 | 0.789 | 0.613 |
RF | 0.758 | 0.865 | 0.704 | 0.652 | 0.674 | 0.267 | 0.756 | 0.769 | 0.613 | |
XGBoost | 0.744 | 0.754 | 0.683 | 0.727 | 0.700 | 0.67 | 0.710 | 0.751 | 0.590 | |
LightGBM | 0.713 | 0.869 | 0.679 | 0.621 | 0.684 | 0.467 | 0.711 | 0.757 | 0.567 | |
CatBoost | 0.744 | 0.874 | 0.690 | 0.696 | 0.746 | 0.533 | 0.726 | 0.777 | 0.583 | |
Batter | LR | 0.911 | 0.933 | 0.933 | 0.888 | 0.978 | 0.72 | 0.889 | 0.967 | 0.72 |
RF | 0.844 | 0.933 | 0.77 | 0.822 | 0.928 | 0.683 | 0.867 | 0.956 | 0.733 | |
XGBoost | 0.800 | 0.894 | 0.783 | 0.822 | 0.944 | 0.75 | 0.822 | 0.939 | 0.75 | |
LightGBM | 0.667 | 0.5 | 0 | 0.667 | 0.5 | 0 | 0.667 | 0.5 | 0 | |
CatBoost | 0.800 | 0.922 | 0.75 | 0.844 | 0.933 | 0.783 | 0.844 | 0.956 | 0.783 |
3.6. Performance improvement techniques—combining KBO data, injury data, and KBO team rankings for comprehensive analysis
For pitchers, considering their higher risk of injuries [28], we included injury history and team ranking as additional variables for predictions (Table 6). Because batters have a lower risk of injury compared with pitchers, we only added the team ranking as an independent variable for the analysis. The modeling results are presented in Table 8, Table 9. The analysis revealed that the team ranking did not have a significant impact on predictions for both pitchers and batters. However, for pitchers, when injury history data were added, all three metrics (accuracy, AUC, and precision) exhibited improved performance. In particular, the XGBoost model exhibited a substantial increase in precision from 0.683 to 0.96, demonstrating the best performance among the models.
Table 8.
Prediction Results with Combined Injury history of Pitchers and KBO Team Ranking (accuracy: ACC; precision: PRE).
Subset of Variables | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
KBO |
KBO + Team ranking |
KBO + injury |
||||||||
ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ||
Pitcher | LR | 0.773 | 0.869 | 0.783 | 0.788 | 0.849 | 0.783 | 0.850 | 0.924 | 0.827 |
RF | 0.758 | 0.865 | 0.704 | 0.759 | 0.855 | 0.763 | 0.879 | 0.942 | 0.870 | |
XGBoost | 0.744 | 0.754 | 0.683 | 0.698 | 0.758 | 0.640 | 0.924 | 0.869 | 0.96 | |
LightGBM | 0.713 | 0.869 | 0.679 | 0.713 | 0.828 | 0.667 | 0.713 | 0.885 | 0.680 | |
CatBoost | 0.744 | 0.874 | 0.690 | 0.729 | 0.837 | 0.677 | 0.924 | 0.959 | 0.920 |
Table 9.
Prediction results with combined batters’ KBO team ranking and image (accuracy: ACC; precision: PRE).
Subset of Variables | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
KBO |
KBO + Team ranking |
KBO + Face |
||||||||
ACC | AUC | PRE | ACC | AUC | PRE | ACC | AUC | PRE | ||
Batter | LR | 0.911 | 0.933 | 0.933 | 0.911 | 0.933 | 0.933 | 0.888 | 0.978 | 0.72 |
RF | 0.844 | 0.933 | 0.77 | 0.822 | 0.922 | 0.77 | 0.822 | 0.928 | 0.683 | |
XGBoost | 0.800 | 0.894 | 0.783 | 0.822 | 0.901 | 0.833 | 0.822 | 0.944 | 0.75 | |
LightGBM | 0.667 | 0.5 | 0 | 0.667 | 0.5 | 0 | 0.667 | 0.5 | 0 | |
CatBoost | 0.800 | 0.922 | 0.75 | 0.800 | 0.922 | 0.75 | 0.844 | 0.933 | 0.783 |
3.7. Feature importance
Performance analysts and scouts at KBO rely on MLB and MiLB performances and their own insights to scout for foreign players. When predicting contract renewals based on MLB and MiLB data, better prediction results were obtained based on MLB data. The major difference between the MiLB and MLB data lies in the availability of tracking data. Tracking data refer to the records of player and ball movements tracked during games, which are primarily collected using technologies such as video, radar, and optical image sensors. These data are used to analyze player movements and strategies and to evaluate game performance [29]. Online platforms such as Baseball Savant have been providing advanced statistics, analyses, and visualization tools for MLB games since 2015, including access to tracking data. Such tracking data provide latest performance metrics for players and serve as a valuable source for evaluating injury risk [30]. When predicting contract renewals for pitchers using MLB data, the XGBoost model exhibited higher performance in respect of precision, whereas the Catboost model exhibited a higher AUC than that of the other models. SHAP was used to explore these important features, as shown in Fig. 2.
Fig. 2.
Feature importance using SHAP: (a) XGBoost model and (b) CatBoost model.
For the XGBoost model, the importance of tracking data such as exit velocity, ERA, angle, and XWOBA was high (Fig. 2(a)). For the CatBoost model, the importance of data such as exit velocity, WOBA, ERA, and balance (Fig. 2(b)) was found to be high. In both models, exit velocity emerged as the most important feature. Exit velocity is a metric that represents the speed of the ball hit by a batter and is measured in miles per hour. It is measured by assessing how fast the ball leaves the bat after the batter contacts the pitch thrown by the pitcher. This metric is primarily used to evaluate and analyze the hitting power and skill of a batter. It indicates how forcefully a batter struck the ball and how quickly the ball traveled. The harder the hit, the higher the exit velocity. High exit velocity typically implies that the ball moves faster and travels a longer distance, indicating a more powerful hit. Therefore, a lower exit velocity is desirable for pitchers.
Histograms (a)–(g) in Fig. 3 represent the exit velocity metric in the MLB, categorized by season, for foreign pitchers who were initially contracted in KBO. The white line in the middle of histograms represents the season-wise average, whereas the red line represents the numbers of the re-contracted foreign players. Numbers within parentheses indicate rankings. Histogram (h) depicts the MLB records for foreign players playing in KBO during the 2023 season, based on their performance in the 2022 MLB season.
Fig. 3.
(a)–(g) Distributions of MLB exit velocity metric for foreign pitchers re-contracted during the 2015–2021 KBO acquisition seasons and (h) the 2022 season MLB exit velocity metric of foreign players playing in the KBO in 2023.
Although the performances of some foreign pitchers (red lines in histograms) were higher than the MLB average (white lines in histograms) in certain seasons, no significant difference in season-wise performances was found. Overall, it is evident from the histograms that even when the exit velocity was lower than average, foreign pitchers were still able to secure contract renewals in KBO.
4. Discussion
In this study, we used accuracy, AUC, and precision as performance evaluation metrics. Precision is an evaluation metric commonly used in machine learning and statistics. It represents the ratio of samples actually classified as positive to those predicted as positive by a classification model. Precision is calculated as follows:
Precision = TP / (TP + FP) |
where TP refers to the number of true-positive cases correctly predicted as positive, and FP refers to the number of false-positive cases incorrectly predicted as positive. In this study, precision represents the ratio of players predicted to be eligible for contract renewal to those whose contracts were actually renewed. These rates were 35 % for pitchers and 37 % for batters.
When using MiLB data, a prediction precision of 0.463 was achieved for pitchers using the XGBoost model; for batters, a precision of 0.507 was achieved using the LR model. These figures indicate increases of approximately 11 % and 13 %, respectively, when compared with the actual contract renewal rate. When the MLB data were used to predict contract renewal, the XGBoost model yielded a prediction precision of 0.463 for pitchers, whereas the LR model yielded a precision of 0.6 for batters. This indicates a 23 % increase in prediction precision for batters when compared with the actual contract renewal rate. When only image data were used, the prediction precision for pitchers was 0.580 in the CatBoost model, whereas it was 0.430 for the batters using the XGBoost model. However, when the image and numerical data were merged, the prediction precision for pitchers and batters increased to 0.5 and 0.533, respectively, using the XGBoost model. For pitchers, this represents an approximately 4 % increase when compared with the precision using only numerical data and a 15 % increase when compared with the actual contract renewal rate. Overall, the performance of the CatBoost, XGBoost, and LR models was good. Particularly when using KBO data, the LR model exhibited the highest performance. However, CatBoost had considerably slower training speeds compared to LR. LightGBM, on the other hand, had fast training speeds but lower performance. Overall, the performance and speed of the XGBoost model were generally satisfactory when considering the overall results.
In KBO, the salary of the newly recruited players is limited to a maximum of approximately $780,000 [31]. When a team releases a foreign player during a season, it incurs a roughly $2 million loss. Therefore, increasing the precision of prediction has significant economic benefits for the team.
Based on Fig. 3, the distribution of MLB performance among the re-contracted players was examined using SHAP. In the 2015 season, except for one player, all three re-contracted players had lower (better) than average (Fig. 3(a)) records. This indicates that the level of performance of foreign players who secured contract renewals in KBO was high, suggesting a high level of their performance in KBO. South Korea won the Premier 12 tournament in 2015. Pat Dean, who recorded a performance higher than the average in the MLB 2016 season (Fig. 3(b)), performed well in KBO in 2017 and secured contract renewal. In 2017, South Korea, which had players without outstanding records, was eliminated in the qualifying round of the World Baseball Classic, finishing 11th. In 2018, scouting was performed based on player performance in the 2017 MLB season. When examining the 2017 season records of foreign players who were re-contracted in KBO in 2018, their performances were not significantly lower than or different from the average (Fig. 3(c)). Except for Seth Frankoff (1), whose evaluation as a player was uncertain owing to recording only 37 pitches in the 2017 season, this interpretation suggests a high level of performance in KBO. In 2018, South Korea actually emerged as the champions in the Asian Games. From 2018 onward, a significant number of foreign pitchers who had a lower-than-average level were among those who were re-contracted (Fig. 3(d)–(g)). In 2020, most players re-contracted in KBO had scores higher than the 2019 MLB average (Fig. 3(e)). This confirms that KBO level was relatively low for this season. In fact, during the 2020 Tokyo Olympics, South Korea finished fourth among the six participating countries. Felix Pena (Fig. 3(g)), with a significantly low pitch count of 65 in the 2021 season, cannot be considered favorable, even if his performance was lower than average. Fig. 3(h) depicts the 2022 MLB performance of foreign pitchers recruited in KBO for the 2023 season. As of July 2023, at the end of the first half of KBO season, Eric Fedde (4) maintained the top position in wins (ERA) and the second position in WHIP [3]. Thus, by observing lower-ranked MLB players maintaining top positions in KBO, one can infer that the level of KBO is declining. This explains why South Korea finished 12th and failed to advance in the World Baseball Classic on three consecutive occasions, as seen in the tournament held in March. Therefore, through post-analysis, the performance in international competitions can be used to estimate the important parameters used in predicting contract renewal decisions (the prediction of advancing to the main tournament).
As seen in Fig. 2, Fig. 3, we observed that the exit velocity metric has a significant impact on contract renewal decisions and can also be used to infer the baseball level of the respective country. Exit velocity is a metric that evaluates the pure abilities of pitchers, independent of the team's influence. Therefore, it can be applied in foreign baseball leagues when recruiting foreign players. In MLB, exit velocity is considered a factor in limiting hard contact and is regarded as an excellent pitching evaluation metric [32]. This study has demonstrated that such tracking data marks the beginning of a new era in baseball fandom [33].
5. Conclusions
In this study, we used the MLB, MiLB, KBO, and image data to predict the chances of the contract renewal of foreign players. We used accuracy, AUC, and precision as evaluation metrics, focusing on the precision during the analysis. For the MLB and MiLB data, we performed preprocessing by standardizing the data based on season owing to different seasonal means and variances. We used LR, RF, XGBoost, LightGBM, and CatBoost classification models, maintaining a fixed seed number, and performed fivefold cross-validation. The results showed that using only MiLB data resulted in a lower accuracy and AUC prediction performance. However, the prediction performances for both pitchers and hitters exhibited an increase of more than 11 % in precision when compared with their actual contract renewal rates. When using MLB data, although accuracy and AUC were still low, a precision of 0.6 was achieved for hitters after season-based standardization, which was 0.23 higher than the actual contract renewal rate. For hitters, when the numerical and image data were merged, the XGBoost model demonstrated performances of 0.722, 0.742, and 0.533 (ACC, AUC, and PRE, respectively). This showed an improvement in performance by 0.15, 0.27, and 0.13 (ACC, AUC, and PRE, respectively) when compared with that using only numerical data with XGBoost. Through exploratory data analysis using SHAP, exit velocity was found to be an important feature for predicting contract renewal. This metric can also be used to predict the performance of players in international baseball competitions. However, a pitcher's injury history had a significant influence on player performance. Although the small amount of available data is a limitation, the insights obtained from this study can help reduce scouting failures and financial losses for teams. Furthermore, as long as baseball exists as a sport, the insights from this study can be useful to teams scouting for foreign players.
Ethics declarations
Informed consent was not required for this study because no human or animal subjects or samples were involved in the study.
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1A2C1092808). This study was also supported by the Soonchunhyang University Research Fund.
Data availability statement
Has data associated with your study been deposited into a publicly available repository?
Yes. All data from this study are in the article, code, supplementary material, or referenced. Datasets are available on GitHub at https://github.com/Ptaeshin/Prediction-of-Re-signing-Foreign-Players/tree/master or upon request from the author.
CRediT authorship contribution statement
Taeshin Park: Writing – review & editing, Writing – original draft, Visualization, Software, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Jaeyun Kim: Writing – review & editing, Validation, Supervision, Resources, Project administration, Funding acquisition, Conceptualization.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Appendix A.
Table 10 lists the MiLB and MLB data used in this study to evaluate the performance of pitchers for the prediction of their contract renewal [[6], [7], [8]].
Table 10.
Description of MiLB and MLB pitcher data used for performance evaluation.
Metric | Abbreviation | Description | |
---|---|---|---|
M i L B |
Wins | W | Number of wins |
Losses | L | Number of losses | |
Saves | SV | Number of saves | |
Games | G | Number of games in which the pitcher appeared | |
Games Started | GS | Number of games the pitcher started | |
Inning Pitched | IP | Number of total innings pitched (.1 represents 1/3 of an inning, .2 represents 2/3 of an inning) | |
Strikeouts per 9 innings | K/9 | Average number of strikeouts per 9 innings | |
Walks per 9 innings | BB/9 | Average number of walks per 9 innings | |
Home Runs per 9 innings | HR/9 | Average number of home runs allowed per 9 innings | |
Batting Average on Balls in Play | BABIP | The rate at which the pitcher allows a hit when the ball is put in play, calculated as (H-HR)/(AB-K-HR + SF) | |
Left On-Base Percentage | LOB % | Percentage of pitcher's own base runners that they strand over the course of a season. Not equal to the LOB column in the box score | |
Ground Ball Percentage | GB % | Percentage of a pitcher's balls in play that are ground balls, calculated as GB/BIP | |
Home Run to Fly Ball Rate | HR/FB | Percentage of a pitcher's fly balls that go for home runs, calculated as HB/FB (even though some HR are line drives) | |
Earned Run Average | ERA | Average number of earned runs a pitcher allows per 9 innings: ((ER*9)/IP) | |
Fielding Independent Pitching | FIP | Estimate of a pitcher's ERA based on strikeouts, walks/HBP, and home runs allowed, assuming league average results on balls in play | |
Expected Fielding Independent Pitching | xFIP | Estimate of a pitcher's ERA based on strikeouts, walks/HBP, and fly balls allowed, assuming league average results on balls in play and home run to fly ball ratio | |
Strikeout to Walk Ratio | K/BB | Strikeouts divided by walks | |
Strikeout Percentage | K% | Strikeouts divided by walks | |
Walk Percentage | BB% | Frequency with which the pitcher has issued a walk, calculated as walks divided by total batters faced | |
Strikeout Percentage minus Walk Percentage | K-BB% | Percentage differential between K% and BB%, often a better indicator of performance than K/BB, which can be skewed by very low walk rates | |
Batting Average Against | AVG | Rate of hits allowed per at bat, calculated as H/AB | |
Walks Plus Hits per Inning Pitched | WHIP | Average number of base runners allowed via hit or walk per inning | |
Ground Ball to Fly Ball Ratio | GB/FB | Ratio of ground balls a pitcher allows to fly balls, calculated as GB/FB | |
Line Drive Percentage | LD% | Percentage of a pitcher's balls in play that are line drives, calculated as LD/BIP | |
Fly Ball Percentage | FB% | Percentage of a pitcher's balls in play that are fly balls, calculated as FB/BIP | |
Infield Fly Ball Percentage | IFFB% | Percentage of a pitcher's fly balls that were infield fly balls, calculated as IFFB/FB | |
Pull Percentage | Pull% | Percentage of batted balls hit to the pull field | |
Center Percentage | Cent% | Percentage of batted balls hit to the middle of the field | |
Opposite Field Percentage | Oppo% | Percentage of batted balls hit to the opposite field | |
M L B |
Age | Age | Player's age at a specific point in time. |
Pitches | Pitches | Number of total pitches thrown | |
Balls | Balls | Number of total balls thrown | |
Exit Velocity | Exit Velocity | How fast, in miles per hour, a batter hit a ball | |
Maximum Exit Velocity | Max EV | Highest recorded speed at which a batted ball leaves the bat | |
Launch Angle | Angle | How high/low, in degrees, a batter hit a ball | |
Sweet Spot | Spot % | a batted-ball event with a launch angle between eight and 32° | |
Barrels | Barrels | Batted ball with the perfect combination of exit velocity and launch angle | |
Barrel per Plate Appearance | Barrel/PA | Average number of barrels a batter produces per plate appearance. | |
Expected Batting Average | XBA | Likelihood that a batted ball will become a hit | |
Expected Slugging Percentage | XSLG | Estimates slugging percentage a player is expected to have based on the quality and outcomes of their batted balls | |
Weighted On-Base Average | WOBA | Combined different hitting outcomes with weighted values to assess a player's overall offensive contribution | |
Expected Weighted On-base Average | XWOBA | Formulated using exit velocity, launch angle and, on certain types of batted balls, Sprint Speed | |
Expected Weighted On-Base Average on Contact | XWOBACON | Estimates weighted on-base average a player is expected to have solely on balls put into play | |
Hard Contact Percentage | HardHit % | Percentage of hard-hit batted balls | |
Strikeout Percentage | K% | Frequency with which the pitcher has struck out a batter, calculated as strikeouts divided by total batters faced | |
Walk Percentage | BB% | Frequency with which the pitcher has issued a walk, calculated as walks divided by total batters faced | |
Earned Run Average | ERA | Average number of earned runs a pitcher allows per 9 innings. ((ER*9)/IP) | |
Expected Earned Run Avg | xERA | Simple 1:1 translation of xwOBA, converted to the ERA scale | |
K B O |
Earned Run Average | ERA | Average number of earned runs a pitcher allows per 9 innings. ((ER*9)/IP) |
Walks Plus Hits per Inning Pitched | WHIP | Average number of base runners allowed via hit or walk per inning | |
Fielding Independent Pitching | FIP | Estimate of a pitcher's ERA based on strikeouts, walks/HBP, and home runs allowed, assuming league average results on balls in play | |
Wins Above Replacement | WAR | Estimates the number of wins a player has been worth to his team compared to a freely available player such as a minor league free agent based on FIP |
Appendix B.
Table 11 lists the MiLB and MLB data used in this study to evaluate the performance of batters for the prediction of their contract renewal [[6], [7], [8]].
Table 11.
Description of MiLB and MLB batter data used for performance evaluation.
Metric | Abbreviation | Description | |
---|---|---|---|
M i L B |
Games | G | Number of games in which the player has appeared |
Plate Appearances | PA | Number of times the player has come to the plate | |
At Bats | AB | Number of trips to the plate in which the batter does not walk, get hit by a pitch, sacrifice (fly or bunt), or reach on interference | |
Runs Scored | R | Number of runs scored | |
Hits | H | Number of hits | |
Doubles | 2B | Number of doubles | |
Triples | 3B | Number of triples | |
Home Runs | HR | Number of home runs | |
Runs Batted In | RBI | Number of times a run scores as a result of a batter's plate appearance, not counting situations in which an error caused the run to score, or the batter hit into a double play | |
Stolen Bases | SB | Number of stolen bases | |
Caught Stealing | CS | Number of times caught stealing | |
Walks | BB | Total number of walks (includes IBB) | |
Strikeouts | SO | Number of strikeouts | |
Batting Average | AVG | Rate of hits per at bat, calculated as H/AB | |
On Base Percentage | OBP | Rate at which the batter reaches base, calculated as (H + BB + HBP)/(AB + BB + HBP + SF) | |
Slugging Percentage | SLG | Average number of total bases per at bat, calculated as Total Bases/AB | |
On Base Plus Slugging | OPS | Combination of OBP and SLG, calculated as OBP + SLG | |
Grounded into Double Play | GDP | Number of times the batter hit into a double play | |
Hit By Pitches | HBP | Number of times the batter reached after being hit by a pitch | |
Sacrifice Bunts | SH | Any bunt in which there was a runner on base and less than two outs in which the batter was put out and at least one runner advanced | |
Sacrifice Flies | SF | Number of times a batter's fly out allowed a runner to tag up and score | |
Intentional Walks | IBB | Number of times the batter was intentionally walked | |
M L B |
Age | Age | Player's age at a specific point in time. |
Pitches | Pitches | Number of total pitches thrown | |
Batted Balls | Batted Balls | Balls hit by a batter during an at-bat that are put into play, excluding foul balls, bunts, and certain other types of non-playable contact. | |
Barrels | Barrels | Batted ball with the perfect combination of exit velocity and launch angle | |
Barrel percentage | Barrel % | Rate at which a batter hits the ball on the sweet spot of the bat, resulting in optimal contact and a higher likelihood of successful outcomes such as extra-base hits. | |
Barrel per Plate Appearance | Barrel/PA | Average number of barrels a batter produces per plate appearance. | |
Exit Velocity | Exit Velocity | How fast, in miles per hour, a batter hit a ball | |
Maximum Exit Velocity | Max EV | Highest recorded speed at which a batted ball leaves the bat | |
Launch Angle | Launch Angle | Angle at which the ball leaves the bat after a batter's swing | |
The Sweet Spot percentage | Sweet Spot % | Probability of a batter hitting the ball on the optimal spot of the bat known as the Sweet Spot | |
Expected Batting Average | XBA | Likelihood that a batted ball will become a hit | |
Expected Slugging Percentage | XSLG | Estimates the slugging percentage a player is expected to have based on the quality and outcomes of their batted balls | |
Weighted On Base Average | WOBA | Combines all the different aspects of hitting into one metric, weighting each of them in proportion to their actual run value | |
Expected Weighted On-Base Average on Contact | XWOBACON | Estimates the weighted on-base average a player is expected to have solely on balls put into play | |
Hard Contact Percentage | HardHit % | Percentage of hard-hit batted balls | |
Strikeout Percentage | K% | Frequency with which the batter has struck out, calculated as strikeouts divided by plate appearances | |
Walk Percentage | BB% | Frequency with which the batter has walked, calculated as walks divided by plate appearances | |
K B O |
Age | Age | Player's age at a specific point in time. |
Games Played | G | Number of games in which the player has appeared | |
Plate Appearances | PA | Number of times the player has come to the plate | |
At Bats | AB | Number of trips to the plate in which the batter does not walk, get hit by a pitch, sacrifice (fly or bunt), or reach on interference | |
Runs Scored | R | Number of runs scored | |
Hits | H | Number of hits | |
Doubles | 2B | Number of doubles | |
Triples | 3B | Number of triples | |
Home Runs | HR | Number of home runs | |
Slugging | Slugging | Total number of bases a batter accumulates per at-bat | |
Runs Batted In | Runs Batted In | Number of times a run scores as a result of a batter's plate appearance, not counting situations in which an error caused the run to score, or the batter hit into a double play | |
Stolen Bases | SB | Number of stolen bases | |
Caught Stealing | CS | Number of times caught stealing | |
Walks | BB | Total number of walks | |
Hit By Pitches | HBP | Number of times the batter reached after being hit by a pitch | |
Intentional Walks | IBB | Number of times the batter was intentionally walked | |
Strikeouts | SO | Number of strikeouts | |
Grounded into Double Play | GDP | Number of times the batter hit into a double play | |
Sacrifice Flies | SF | Number of times a batter's fly out allowed a runner to tag up and score | |
SH | SH | Any bunt in which there was a runner on base and less than two outs in which the batter was put out and at least one runner advanced | |
Batting Average | AVG | Rate of hits per at bat, calculated as H/AB | |
On Base Percentage | OBP | Rate at which the batter reaches base, calculated as (H + BB + HBP)/(AB + BB + HBP + SF) | |
Slugging Percentage | SLG | Average number of total bases per at bat, calculated as Total Bases/AB | |
On Base Plus Slugging | OPS | Combination of OBP and SLG, calculated as OBP + SLG | |
Weighted On Base Average | wOBA | Combines all the different aspects of hitting into one metric, weighting each of them in proportion to their actual run value | |
Weighted Runs Created Plus | wRC+ | Most comprehensive rate statistic used to measure hitting performance because it considers the varying weights of each offensive action (like wOBA) and then adjusts them for the park and league context in which they took place | |
Wins Above Replacement | WAR | Comprehensive statistic that estimates the number of wins a player has been worth to his team compared to a freely available player such as a minor league free agent based on FIP |
References
- 1.Surging interest in Korean baseball in the United States... ESPN and KBO live Broadcast. https://www.voakorea.com/a/korea_korea-life_aisasociety-baseball/6032559.html Available online:
- 2.http://www.statiz.co.kr/stat.php STATIZ. Available online :
- 3.https://www.koreabaseball.com/TeamRank/TeamRank.aspx KBO Homepage. Available online :
- 4.Elitzur R. Data analytics effects in major league baseball. Omega. 2020;90 [Google Scholar]
- 5.Fialho G., Manhães A., Teixeira J.P. Predicting sports results with artificial intelligence–a proposal framework for soccer games. Proc. Comput. Sci. 2019;164:131–136. [Google Scholar]
- 6.MLB Stats, Scores, History, & Records | Baseball-Reference.com. Available online: https://www.baseball-reference.com/.(Accessed 12 July 2023).
- 7.FanGraphs baseball | baseball statistics and analysis. https://www.fangraphs.com/ Available online :
- 8.Savant Baseball, Players Trending MLB. Statcast and Visualizations | baseballsavant.com. https://baseballsavant.mlb.com/ Available online :
- 9.https://www.prosportstransactions.com/baseball/index.htm Professional Baseball Transactions Archive. Available online :
- 10.Huang M.-L., Li Y.-Z. Use of machine learning and deep learning to predict the outcomes of major league baseball matches. Appl. Sci. 2021;11:4499. [Google Scholar]
- 11.Yaseen A.S., Marhoon A.F., Saleem S.A. Multimodal machine learning for major league baseball playoff prediction. Informatica. 2022:46. [Google Scholar]
- 12.Valero C.S. Predicting Win-Loss outcomes in MLB regular season games–A comparative study using data mining methods. Int. J. Comput. Sci. Sport. 2016;15:91–112. [Google Scholar]
- 13.Elfrink T. Vrije Universiteit Amsterdam; 2018. Predicting the Outcomes of MLB Games with a Machine Learning Approach. [Google Scholar]
- 14.Horvat T., Job J. The use of machine learning in sport outcome prediction: a review. Wiley Interdiscip. Rev.: Data Min. Knowl. Disc. 2020;10 [Google Scholar]
- 15.Osken C., Onay C. Predicting the winning team in basketball: a novel approach. Heliyon. 2022:8. doi: 10.1016/j.heliyon.2022.e12189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wright R.E. In: Reading and Understanding Multivariate Statistics. Grimm L.G., Yarnold P.R., editors. American Psychological Association; 1995. Logistic regression; pp. 217–244. [Google Scholar]
- 17.Zhang C., Ma Y. Springer; 2012. Ensemble Machine Learning: Methods and Applications. [Google Scholar]
- 18.Chen T., Guestrin C. Xgboost. Proceedings of the Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. 2016. A scalable tree boosting system; pp. 785–794. [Google Scholar]
- 19.Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.-Y. Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017;30 [Google Scholar]
- 20.Prokhorenkova L., Gusev G., Vorobev A., Dorogush A.V., Gulin A. CatBoost: unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018;31 [Google Scholar]
- 21.Parkhi O., Vedaldi A., Zisserman A. Proceedings of the British Machine Vision Conference. 2015. Deep face recognition; p. 2015. [Google Scholar]
- 22.Huang G.B., Mattar M., Berg T., Learned-Miller E. Proceedings of the Workshop on Faces in'Real-Life'Images: Detection, Alignment, and Recognition. 2008. Labeled faces in the wild: a database forstudying face recognition in unconstrained environments. [Google Scholar]
- 23.Guo Y., Zhang L., Hu Y., He X., Gao J. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference. Proceedings, Part III 14; Amsterdam, The Netherlands: 2016. Ms-celeb-1m: a dataset and benchmark for large-scale face recognition; pp. 87–102. October 11-14, 2016. [Google Scholar]
- 24.Girdhar R., El-Nouby A., Liu Z., Singh M., Alwala K.V., Joulin A., Misra I. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. Imagebind: one embedding space to bind them all; pp. 15180–15190. [Google Scholar]
- 25.Liu H., Ding H., Xuan J., Gao X., Huang X. The functional movement screen predicts sports injuries in Chinese college students at different levels of physical activity and sports performance. Heliyon. 2023;9 doi: 10.1016/j.heliyon.2023.e16454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Trovato B., Petrigna L., Sortino M., Roggio F., Musumeci G. The influence of different sports on cartilage adaptations: a systematic review. Heliyon. 2023;9 doi: 10.1016/j.heliyon.2023.e14136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Park T., Kim J. A predictive model for a contract renewal of foreign pitchers in KBO using machine learning. Korean Data Inf. Sci. Soc. 2022;33:963–976. [Google Scholar]
- 28.Shanley E., Kissenberth M.J., Thigpen C.A., Bailey L.B., Hawkins R.J., Michener L.A., Tokish J.M., Rauh M.J. Preseason shoulder range of motion screening as a predictor of injury among youth and adolescent baseball pitchers. J. Shoulder Elbow Surg. 2015;24:1005–1013. doi: 10.1016/j.jse.2015.03.012. [DOI] [PubMed] [Google Scholar]
- 29.Statcast Search | baseballsavant.com. https://baseballsavant.mlb.com/statcast_search Available online :
- 30.Pollack K.M., D'Angelo J., Green G., Conte S., Fealy S., Marinak C., McFarland E., Curriero F.C. Developing and implementing Major League Baseball's health and injury tracking system. Am. J. Epidemiol. 2016;183:490–496. doi: 10.1093/aje/kwv348. [DOI] [PubMed] [Google Scholar]
- 31.https://www.koreabaseball.com/News/Notice/View.aspx?bdSe=8542 KBO Homepage. Available online :
- 32.Exit velocity (EV) | Glossary | MLB.com. https://www.mlb.com/glossary/statcast/exit-velocity Available online :
- 33.https://www.mlb.com/glossary/statcast Statcast | Glossary | MLB.com. Available online :
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Has data associated with your study been deposited into a publicly available repository?
Yes. All data from this study are in the article, code, supplementary material, or referenced. Datasets are available on GitHub at https://github.com/Ptaeshin/Prediction-of-Re-signing-Foreign-Players/tree/master or upon request from the author.