Machine learning-based optimization of contract renewal predictions in Korea Baseball organization

Taeshin Park; Jaeyun Kim

doi:10.1016/j.heliyon.2023.e23231

. 2023 Dec 3;9(12):e23231. doi: 10.1016/j.heliyon.2023.e23231

Machine learning-based optimization of contract renewal predictions in Korea Baseball organization

Taeshin Park ^a, Jaeyun Kim ^b,^∗

PMCID: PMC10750069 PMID: 38149193

Abstract

The Korea Baseball Organization (KBO) introduced the foreign player system in 1998 to enhance league competitiveness. In 2014, the number of foreign players increased to three per team. Since then, the ten KBO teams have routinely included two foreign pitchers and one batter. While the performance of foreign players significantly impacts the post-season qualification of the team, the contract renewal rates for pitchers and batters are only 34 % and 36 %, respectively. Therefore, a method that can aid in the contract renewal decision can help teams recruit high-caliber foreign players, improve their performance, raise the level of KBO league, and provide an enjoyable experience for baseball fans. In this study, we use machine learning methods to predict the contract renewal decision and compare the prediction performances of various models. We use data on foreign player performances in the Minor League Baseball and Major League Baseball immediately prior to joining KBO, in KBO upon joining, and player image data. By comparing the accuracy, area under the receiver-operating characteristic curve, and precision of prediction results based on performance in each league, we find that performance in KBO plays a significant role in improving the prediction. Additionally, a post-hoc analysis of batters reveals a gradual decline in the performance level of foreign batters who succeeded in KBO, which is found to be related to the results of international baseball tournaments. In conclusion, the proposed approach to player performance evaluation and contract renewal decisions can contribute to the long-term success of the teams and league.

Keywords: Contract renewal, Foreign players, International baseball tournaments, Korea baseball organization, Machine learning

1. Introduction

In May 2020, during the COVID-19 pandemic, ESPN started broadcasting games from the Korea Baseball Organization (KBO) live. This move sparked a surge of interest in KBO among fans and the American sports industry. Korean players like Chan-ho Park, Hyun-jin Ryu, and Jung-ho Kang, who had previously demonstrated their skills in Major League Baseball (MLB), drew significant attention from the American sports community and fans. Their performances fueled discussions on American media and social platforms. In June 2020, the Asia Society, an international academic research organization, hosted an online event themed “The Fascinating World of Korean Professional Baseball.” During this event, Matt Schiavenza, the content director of the organization, expressed his admiration for Korean baseball. He stated, “Baseball fans around the world now have the opportunity to witness one of the best professional baseball leagues in the world, the KBO” [1].

KBO has a 10-team structure. Similarly to MLB, KBO implements the designated hitter system, allowing a total of ten player positions in the game. To promote league competitiveness and provide fresh entertainment for baseball fans, KBO introduced a foreign player system in 1998, enabling each team to have up to two foreign players. Initially, most teams recruited two foreign pitchers. However, since 2014, the limit for foreign players has been increased to three per team, allowing the recruitment of foreign batters. Currently, each team commonly adopts a configuration of two pitchers and one batter, which is the prevailing practice in the league.

Each KBO team strives to attract audiences and generate profits by targeting postseason qualification, regular season championships, and the Korean Series championships. An examination of the Wins Above Replacement (WAR) records for the past eight years, maintained by Statiz, for each position reveals that teams with the top five pitchers have a 77.5 % probability of making the postseason, whereas teams with top-ranking shortstops have a 72.5 % probability, indicating the significant impact of pitchers [2]. Because KBO operates with a five-man starting rotation, with two spots typically filled by foreign pitchers, the importance of scouting for talented players becomes even more crucial.

Foreign KBO players do not sign multi-year contracts; instead, their contracts are renewed every season. If a player's performance is deemed unsatisfactory, the team must still pay the full contract amount upon their release, resulting in financial losses. Therefore, acquiring talented players is crucial for teams to avoid losses and increase their probability of postseason qualification. A team's decision to engage in contract renewal with a player can be interpreted as a sign of satisfaction with their performance and can be considered a successful scouting effort. In fact, over the past eight years, the probability of a team making a postseason when a player who debuted in KBO renews their contract is approximately 64 % for batters and 50.85 % for pitchers [2,3]. If at least one out of the three foreign players secures contract renewal, the probability of post-season qualification exceeds 50 %.

Every year, each KBO team dispatches scouts to recruit foreign players who can perform well in the league. However, the contract renewal rates for foreign pitchers and batters have been 34 % and 36 %, respectively, for the 2014–2022 seasons, highlighting the challenges faced by scouts in finding foreign players who could succeed in KBO [3]. This clearly demonstrates that scouts from KBO teams are uncertain about which evaluation metrics from MLB and Minor League Baseball (MiLB) are particularly important for scouting. The importance of sabermetrics was highlighted in the movie “Moneyball” and Michael Lewis's book, “Moneyball: The Art of Winning an Unfair Game,” emphasizing the significance of data analysis in the field of sports [4]. Furthermore, with the development of the sports gaming industry, the use of analytical tools to predict the outcomes of sports games has become increasingly important [5]. Recent studies have used data from sources such as Baseball-Reference, FanGraphs, Baseball Savant, and the Professional Baseball Transactions Archive [[6], [7], [8], [9]] to conduct research using machine learning and deep learning techniques. Huang and Li [10] proposed the use of the artificial neural network (ANN), support vector machine (SVM), and one-dimensional convolutional neural network for win–loss prediction. Yaseen et al. [11] used logistic regression (LR) and SVM to predict playoff qualifications. Valero [12] used the ANN, SVM, Decision Tree, and K-NN models to predict the results of MLB games. Elfrink [13] constructed models using random forest (RF), XGBoost, linear models, and boosted LR to predict game outcomes.

As is evident from the aforementioned studies, research in the field of baseball has primarily focused on predicting game outcomes. Similar research has been performed in other sports disciplines, again predominantly focusing on predicting game outcomes [14,15]. However, this study focuses on predicting the performance of foreign players, which plays a vital role in determining team rankings, postseason qualifications, and contract renewals in KBO. Moreover, with the increase in the popularity of KBO among foreign players, inferring the performance required for foreign players to secure contract renewals is also necessary. This study focuses on the performance levels of foreign players who have undergone contract renewals and their correlation with the international achievements of the South Korean baseball teams. Experimental results suggest that player performance in KBO plays a significant role in improving the prediction.

The aim of this study is to elucidate the impact of foreign players' performance in KBO on contract renewal decisions, with the goal of deriving valuable insights for scouting and team management. Through this research, we aim to address the following questions: Firstly, which performance metrics are particularly crucial in scouting? Secondly, do KBO performances of foreign players influence scouting and contract renewal decisions? Lastly, can the findings of this study be applied to overseas baseball leagues, including but not limited to the United States, Japan, and Australia, which recruit foreign players? The implementation of these findings can provide a robust framework for player recruitment, reducing the risk of financial losses associated with scouting failures and fostering a consistent performance in international competitions.

2. Materials and methods

2.1. Machine learning models

Five machine learning models were used in this study. Among them, LR applies linear regression to binary classification problems. It applies a linear combination of input variables and weights to a sigmoid function, producing probability values between 0 and 1. LR is used to classify data based on a decision boundary [16]. RF is an ensemble learning method that constructs multiple decision trees to make predictions. Each tree is trained on randomly selected data, and the predictions are combined by ensembling their results. RF is effective for reducing overfitting and improving prediction performance [17]. XGBoost is an ensemble learning model based on gradient-boosting trees. It iteratively trains trees to minimize the loss function and utilizes the prediction results in the training of the next tree to improve prediction performance. It is known for its excellent prediction performance and scalability [18]. LightGBM is a lightweight machine learning framework that uses efficient methods to grow trees by considering the characteristics and distribution of data. It provides fast training and prediction speeds and demonstrates excellent performance even on large-scale datasets [19]. CatBoost is a gradient-boosting algorithm that handles categorical variables. It performs automatic transformations of categorical variables and balances the tree structure to improve prediction performance. It provides unique methods for handling categorical variables, reduces data bias, and prevents overfitting [20].

2.2. VGGFace

The objective of VGGFace, proposed by Parkhi et al. [21], is to achieve human-level performance in the task of face recognition. VGGFace has demonstrated excellent performance on the Labeled Faces in the Wild dataset, which consists of diverse face images collected from around the world [22], and the YouTube Faces Database, which is composed of various face images extracted from YouTube videos [23]. It is a deep learning-based model used for face image classification and face feature vector extraction. This model is based on the VGGNet architecture and learns diverse facial features using deep neural networks and convolutional layers with filters. It outperforms other face-recognition algorithms and has proven its potential for practical applications. The VGGFace model can be configured using one of the following model parameters: VGG16, ResNet50, or Senet50. This choice influences the performance and characteristics of the model. The model extracts features by passing input images of size 224 × 224 through multiple convolutional layers. The output of the last convolutional layer is fed to fully connected layers to generate features based on the chosen model parameters. In this study, optimization was performed based on these parameter values.

2.3. Data sources and collection

The MLB, MiLB, and KBO data used in this study—Korea Baseball, Statiz, Baseball-Reference, FanGraphs, and Baseball Savant—are publicly available and validated datasets and have been previously used in related research [3,4,[6], [7], [8]]. We used Korea Baseball and Statiz for data on KBO player performances for the season immediately prior to entry, team rankings upon entry, and injury history. Injuries were encoded and considered only in cases where players were listed as injured for more than four weeks. FanGraphs and Baseball Savant were used for MLB and MiLB performance data, as well as tracking data, for the season immediately prior to entry to KBO. If no MLB or MiLB records were available for the season immediately prior to entry to KBO, records from the past three years were used. Players with no records within the past three years were excluded from the analysis. Finally, the data of 123 pitchers and 77 hitters were selected for this study. Among them, 43 pitchers (35 %) and 28 hitters (37 %) secured contract renewals.

According to Girdhar et al. [24], merging image and numerical data improves prediction performance. Therefore, in this study, we compared the performances of using only numerical data and merging image and numerical data by collecting player profile picture data from Baseball-Reference and performing optimization.

Two virtual environments were used in this study. The first environment was utilized to embed image data, and the second was designated for preprocessing and modeling. In the first environment, Python 3.7.16, Keras 2.3.1, and TensorFlow 1.14.0 were employed. The second environment incorporated XGBoost 1.5.0, LightGBM 3.2.1, CatBoost 1.0.6, Python 3.9.13, scikit-learn 1.1.0 (LR, RF), and SHAP 0.41.0. The model architectures can also be examined on our GitHub repository at https://github.com/Ptaeshin/Prediction-of-Re-signing-Foreign-Players/tree/master.

2.4. Research framework

The objective of this study was to predict the contract renewals of foreign players in KBO. The framework followed is shown in Fig. 1 and is implemented for both pitchers and hitters. During the data collection phase, three main components were collected: MLB or MiLB data prior to KBO entry, KBO entry season data, and face image data. In the preprocessing phase, standardization was performed using season-specific data from FanGraphs for both MLB and MiLB. For instance, when comparing FIP values for the 2015 and 2016 seasons, even if both values are 3.0, the data's mean and standard deviation differ between seasons. Therefore, separate standardization was conducted for both pitchers and hitters, utilizing the complete player data for each season in MLB and MiLB. While the MLB and MiLB provide data for all players per season, certain evaluation metrics in KBO are only available for the top 30 players, making standardization difficult. Therefore, this step was omitted. Image data were embedded using the VGGFace model. The modeling phase involved the use of LR and four tree-based models.

Based on modeling results, performance improvement techniques were applied by merging numerical and image data. A comparison was made between the original embedding results of the VGGFace model and the dimensionality reduction of image data using principal component analysis (PCA) and kernel PCA.

3. Results

In this study, Python was used to perform fivefold cross-validation with fixed seed numbers, using MLB records, MiLB records, and image data as independent variables and contract renewal status as the dependent variable. Five models, namely, LR, RF, XGBoost, LightGBM, and CatBoost classification, were used for the analyses. It is worth noting that, for all models, default hyperparameter settings consistently yielded the best performance, outperforming model-specific hyperparameter tuning. Accuracy, area under the receiver-operating characteristic curve (AUC), and precision were used as evaluation metrics. Furthermore, for performance improvement, label encoding was performed based on the presence of an injury history lasting for more than four weeks during the player's tenure in KBO. Additionally, team performance was considered an independent variable because teams may seek to recruit higher-level players based on their team rankings. While the injury history prior to joining KBO is an important factor in scouting [25,26], collecting the MiLB injury records at the level of KBO scouting is challenging because the Pro Sports Transactions Archives only provide MLB injury records. Based on the results of the analysis, Shapley Additive Explanations (SHAP) was used to examine the important features. SHAP is an algorithm used to explain the predictions of machine learning models by evaluating the contribution of each feature to the prediction. Inferring trends at KBO level was achieved by performing data analysis on features of high importance. Furthermore, owing to variations in the average and standard deviation of records across seasons, data were standardized based on the season before making predictions.

3.1. Re-contracting prediction based on MiLB records

For the MiLB pitcher data, modeling was performed with 29 independent variables collected from Fangraphs and contract renewal status obtained from Statiz as the dependent variable. Among the 123 pitchers, those who had no MiLB (AAA) records in the three seasons preceding their KBO entry season or who pitched fewer than 30 innings in MiLB or KBO were excluded. Therefore, the data of only 103 players were analyzed. For MiLB hitter data, modeling was performed with 22 independent variables obtained from Fangraphs and contract renewal status obtained from Statiz as the dependent variable. Among the 77 players, those with fewer than 30 plate appearances in KBO were excluded. Consequently, the data of only 76 players were used for modeling.

3.1.1. Nonstandardization

The results of modeling without season-wise standardization preprocessing are presented in Table 1. For both pitchers and batters, the LR model outperformed the other models in terms of all three metrics. However, the model performance could not be evaluated as good based on the accuracy and AUC evaluation metrics. Furthermore, the precision values for pitchers and batters were 0.433 and 0.507, respectively, which were higher than the contract renewal rates of 0.35 for pitchers and 0.37 for batters. We discuss these findings in more detail in Section 4.

Table 1.

Prediction Results based on Nonstandardized MiLB Records (Accuracy: ACC; Precision: PRE).

Models
		LR			RF			XGBoot			LightGBM			CatBoost
	CV	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE
Pitcher	1	0.667	0.529	0.571	0.429	0.25	0	0.381	0.135	0	0.524	0.240	0.25	0.429	0.240	0
	2	0.476	0.529	0.2	0.524	0.254	0.333	0.476	0.413	0.333	0.476	0.404	0.256	0.524	0.529	0.333
	3	0.619	0.558	0.5	0.714	0.514	0.4	0.571	0.510	0.4	0.571	0.413	0.429	0.571	0.490	0.4
	4	0.7	0.615	0.667	0.6	0.505	0	0.65	0.538	0.5	0.55	0.440	0.333	0.7	0.549	0.667
	5	0.5	0.438	0.25	0.55	0.323	0.5	0.5	0.592	0.333	0.4	0.271	0.167	0.55	0.344	0.333
	Average	0.592	0.533	0.438	0.563	0.423	0.247	0.516	0.378	0.313	0.504	0.354	0.293	0.554	0.431	0.347
Batter	1	0.625	0.6	0.5	0.563	0.45	0.25	0.5	0.417	0.375	0.625	0.583	0.5	0.438	0.467	0.2
	2	0.667	0.611	0.667	0.533	0.491	0.333	0.467	0.463	0.25	0.467	0.472	0	0.533	0.5	0.333
	3	0.6	0.611	0.5	0.6	0.556	0.5	0.533	0.593	0.333	0.6	0.630	0.5	0.467	0.630	0
	4	0.733	0.48	0.667	0.667	0.98	1	0.733	0.82	0.667	0.733	0.8	1	0.8	0.96	1
	5	0.467	0.47	0.2	0.467	0.61	0.2	0.533	0.63	0.333	0.533	0.61	0.333	0.533	0.65	0.333
	Average	0.618	0.554	0.507	0.566	0.617	0.457	0.553	0.584	0.392	0.592	0.619	0.467	0.554	0.641	0.373

Open in a new tab

3.1.2. Preprocessing—standardization

The results of modeling with standardized MiLB data by season are presented in Table 2. Generally, performance decreased for pitchers, except in the LR model. Similarly, the performance of batters declined as well. However, the precision value for batters was 0.477, which was higher than the contract renewal rate of 0.37 for batters.

Table 2.

Prediction Results based on Standardized MiLB Records (Accuracy: ACC; Precision: PRE).

Models
		LR			RF			XGBoot			LightGBM			CatBoost
	CV	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE
Pitcher	1	0.333	0.337	0.2	0.429	0.361	0.333	0.476	0.337	0.333	0.381	0.25	0.273	0.429	0.337	0.3
	2	0.524	0.5	0.25	0.619	0.471	0.4	0.667	0.548	0.571	0.762	0.587	0.714	0.667	0.510	0.6
	3	0.333	0.356	0.2	0.571	0.312	0.333	0.524	0.404	0.25	0.476	0.462	0	0.571	0.385	0.333
	4	0.5	0.297	0.286	0.55	0.269	0.333	0.55	0.395	0.333	0.65	0.505	0.5	0.6	0.495	0
	5	0.5	0.5	0.333	0.55	0.417	0	0.6	0.458	0.5	0.45	0.406	0.286	0.55	0.512	0.4
	Average	0.438	0.398	0.253	0.544	0.367	0.280	0.563	0.428	0.400	0.544	0.442	0.355	0.563	0.453	0.327
Batter	1	0.688	0.667	0.6	0.438	0.408	0	0.375	0.383	0.167	0.5	0.517	0.333	0.5	0.333	0
	2	0.6	0.611	0.5	0.533	0.472	0.333	0.533	0.519	0.333	0.4	0.546	0	0.467	0.519	0
	3	0.533	0.593	0.333	0.533	0.481	0.333	0.467	0.556	0.333	0.6	0.5	0.5	0.533	0.556	0.333
	4	0.8	0.72	0.75	0.8	0.97	1	0.733	0.8	0.667	0.8	0.7	1	0.8	0.94	1
	5	0.467	0.54	0.2	0.467	0.59	0.25	0.6	0.7	0.429	0.533	0.6	0.25	0.467	0.56	0.2
	Average	0.618	0.627	0.477	0.554	0.584	0.383	0.542	0.591	0.386	0.567	0.573	0.417	0.553	0.581	0.307

Open in a new tab

3.2. Re-contracting prediction based on MLB records

For the MLB pitcher data, we used 20 independent variables obtained from Baseball Savant. Of the 123 players, MLB records for only 66 players were available and were included in the analysis. For the batter data, we used 18 independent variables obtained from Baseball Savant. Of the 77 players, 45 had MLB records and were included in the analysis.

3.2.1. Nonstandardization

The results of modeling using the MLB data without standardization preprocessing are listed in Table 3. For pitchers, no significant difference was found when compared with that using the MiLB data. However, for batters, the performance was lower than that when using MiLB data as an independent variable.

Table 3.

Prediction Results based on Nonstandardized MLB Records (Accuracy: ACC; Precision PRE).

Models
		LR			RF			XGBoot			LightGBM			CatBoost
	CV	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE
Pitcher	1	0.286	0.311	0	0.714	0.311	0	0.714	0.622	0.667	0.643	0.578	0	0.5	0.556	0
	2	0.462	0.425	0.25	0.538	0.5	0.25	0.462	0.425	0.25	0.692	0.575	0.667	0.462	0.425	0.25
	3	0.231	0.15	0	0.538	0.275	0	0.538	0.275	0.333	0.462	0.525	0.25	0.615	0.4	0.5
	4	0.462	0.375	0	0.615	0.575	0	0.692	0.5	0.667	0.462	0.425	0	0.692	0.625	1
	5	0.615	0.556	0.333	0.462	0.389	0	0.615	0.417	0.4	0.538	0.556	0.25	0.385	0.333	0
	Average	0.411	0.363	0.117	0.574	0.410	0.05	0.604	0.448	0.463	0.559	0.532	0.233	0.531	0.468	0.35
Batter	1	0.667	0.25	0	0.556	0.139	0	0.444	0.25	0	0.667	0.5	0	0.556	0.25	0
	2	0.444	0.556	0.25	0.333	0.25	0.25	0.333	0.333	0	0.667	0.5	0	0.444	0.167	0.25
	3	0.556	0.722	0.333	0.556	0.556	0	0.667	0.5	0.5	0.667	0.5	0	0.556	0.5	0
	4	0.444	0.278	0	0.556	0.333	0	0.667	0.444	0.5	0.667	0.5	0	0.556	0.444	0
	5	0.667	0.611	0.5	0.667	0.611	0.5	0.444	0.556	0.25	0.667	0.5	0	0.667	0.5	0.5
	Average	0.556	0.483	0.217	0.533	0.378	0.15	0.511	0.417	0.25	0.667	0.5	0	0.556	0.372	0.15

Open in a new tab

3.2.2. Preprocessing—standardization

The results of modeling using MLB data standardized by season are presented in Table 4. Even after standardization, no significant difference was found when compared with the result obtained using MiLB data. However, for batters, when using the LR model, precision was higher at 0.6 when compared with those in the other cases.

Table 4.

Prediction Results based on Standardized MLB Records (Accuracy: ACC; Precision: PRE).

Models
		LR			RF			XGBoot			LightGBM			CatBoost
	CV	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE
Pitcher	1	0.5	0.467	0.333	0.643	0.689	0	0.786	0.711	1	0.643	0.422	0	0.756	0.711	1
	2	0.538	0.375	0.4	0.615	0.475	0.333	0.615	0.525	0.5	0.462	0.4	0.25	0.615	0.525	0.5
	3	0.538	0.7	0	0.692	0.725	1	0.615	0.55	0.5	0.615	0.7	0.5	0.615	0.65	0.5
	4	0.462	0.2	0	0.462	0.138	0.25	0.308	0.15	0	0.308	0.025	0	0.385	0.2	0
	5	0.385	0.528	0.167	0.538	0.514	0	0.538	0.417	0	0.615	0.528	0.333	0.538	0.417	0
	Average	0.485	0.454	0.18	0.590	0.508	0.317	0.573	0.471	0.4	0.529	0.415	0.217	0.588	0.501	0.4
Batter	1	0.444	0.361	0	0.556	0.361	0	0.556	0.194	0	0.667	0.5	0	0.556	0.25	0
	2	0.889	0.722	1	0.667	0.5	0.5	0.556	0.667	0.4	0.677	0.5	0	0.667	0.556	0.5
	3	0.778	0.778	1	0.667	0.556	0	0.667	0.389	0	0.667	0.5	0	0.555	0.389	0
	4	0.778	0.333	1	0.556	0.556	0.5	0.444	0.5	0	0.667	0.5	0	0.667	0.444	0
	5	0.333	0.278	0	0.556	0.417	0	0.444	0.444	0	0.667	0.5	0	0.556	0.444	0
	Average	0.644	0.494	0.6	0.6	0.478	0.2	0.533	0.439	0.08	0.667	0.5	0	0.6	0.417	0.1

Open in a new tab

3.3. Re-contracting prediction based on face image optimization

The VGGFace model supports various embedding structures based on model parameter values. Table 5 lists the modeling results for pitchers and batters based on various values for model parameters. When using Resnet50 and Senet50 as model parameters, 2048 features were generated, whereas using VGG16 resulted in 512 features. Analyses were performed for all 123 pitchers and 77 batters. For both pitchers and batters, no significant differences in accuracy or AUC were found based on the model parameters. However, for pitchers, the Resnet50 parameter achieved higher performance on all three evaluation metrics when compared with that using only MiLB or MLB data. Similarly, for batters, the Resnet50 parameter outperformed the other parameters in terms of performance.

Table 5.

Prediction Results based on Face Image (Accuracy: ACC; Precision: PRE).

Models
		LR			RF			XGBoot			LightGBM			CatBoost
	Model Parameter	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE
Pitcher	Resnet50 (2048 features)	0.659	0.556	0.2	0.642	0.582	0.667	0.586	0.585	0.35	0.634	0.603	0.476	0.667	0.621	0.58
	Senet50 (2048 features)	0.659	0.443	0.2	0.610	0.561	0.433	0.552	0.465	0.347	0.576	0.478	0.317	0.618	0.556	0.467
	Vgg16 (512 features)	0.651	0.538	0	0.634	0.579	0	0.634	0.526	0.538	0.578	0.507	0.358	0.659	0.516	0.2
Batter	Resnet50 (2048 features)	0.650	0.569	0.1	0.650	0.489	0.1	0.651	0.567	0.32	0.586	0.619	0.409	0.571	0.536	0.233
	Senet50 (2048 features)	0.624	0.565	0	0.624	0.585	0	0.546	0.443	0.301	0.634	0.517	0.481	0.624	0.590	0.2
	Vgg16 (512 features)	0.650	0.5	0	0.650	0.484	0.3	0.610	0.5	0.38	0.557	0.463	0.280	0.611	0.467	0

Open in a new tab

3.4. Performance improvement techniques—image data fusion

Following the research by Girdhar et al. [24], prediction performance was enhanced by merging image and numerical data. Based on the results listed in Table 5, the Resnet50 parameter exhibited better performance than that of the other parameters; therefore, Resnet50 was selected. The analysis included 66 pitchers and 45 batters with MLB data. The variables were described as follows:

•
Face + Major: 2048 features obtained from embedding results were combined with major data variables (pitcher:20, batter:18).
•
Facial PCA + Major: 2048 features obtained from embedding results were dimensionally reduced using PCA. For pitchers, the cumulative contribution of the first and second principal components accounted for 84 % of the variance; for batters, their contribution accounted for 72 %. Therefore, the two principal components derived from the facial data through dimension reduction and major data variables were merged as independent variables.
•
Face-kernel PCA + Major: Similar to PCA, both pitcher and batter data underwent dimension reduction using kernel PCA with two principal components, which were then combined with the major data variables as independent variables.

The modeling results are listed in Table 6. For both pitchers and batters, no significant difference between the PCA and kernel PCA was found. For pitchers, no major changes in accuracy or AUC were found when image data were integrated; however, precision improved in the RF and XGBoost models. For batters, when the image and numerical data were directly merged without dimension reduction, significant performance improvements were observed in the accuracy, AUC, and precision metrics in the XGBoost model.

Table 6.

Prediction results with combined image and numerical data (accuracy: ACC; precision: PRE).

Subset of Variables
		Face + Major			Face PCA + Major			Face kernel PCA + Major
		ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE
Pitcher	LR	0.560	0.466	0.239	0.484	0.409	0.187	0.485	0.409	0.187
	RF	0.635	0.487	0.207	0.559	0.415	0.407	0.544	0.563	0.567
	XGBoost	0.634	0.478	0.5	0.499	0.478	0.267	0.499	0.478	0.267
	LightGBM	0.530	0.311	0	0.499	0.388	0.133	0.499	0.388	0.133
	CatBoost	0.604	0.514	0.25	0.573	0.501	0.35	0.573	0.501	0.35
Batter	LR	0.633	0.553	0.467	0.633	0.58	0.467	0.633	0.58	0.467
	RF	0.633	0.559	0.267	0.703	0.568	0.52	0.658	0.548	0.5
	XGBoost	0.722	0.742	0.533	0.544	0.538	0.367	0.544	0.538	0.367
	LightGBM	0.658	0.5	0	0.658	0.5	0	0.658	0.5	0
	CatBoost	0.678	0.624	0.5	0.658	0.500	0.3	0.658	0.500	0.3

Open in a new tab

3.5. Re-contracting prediction based on KBO records

Apart from teams scouting for players, foreign players who aspire to play in KBO may also need to predict their expected performance to secure contract renewal. To address this, we merged the records provided by the official KBO website and Statiz to predict contract renewal. For pitchers, we analyzed 66 players with MLB records. Considering that pitchers are heavily influenced by both the offensive and defensive capabilities of the team, we used four independent variables to assess their individual abilities: ERA, WHIP, FIP, and WAR metrics (see Table 10, Table 11 for descriptions of these other metrics used in this study) [27]. For batters, we analyzed 45 players with MLB records and used 27 independent variables obtained from Statiz. Because KBO does not provide season-by-season records for individual players like MLB and MiLB, we were unable to perform season-based standardization. Therefore, we performed predictions without standardization. Additionally, we experimented with incorporating image data into KBO data to evaluate whether it would enhance the performance. The modeling results are presented in Table 7. The analysis revealed that the performance of pitchers on all three metrics declined when image data were added to KBO entry-season data. For batters, although accuracy and precision decreased, AUC exhibited a slight increase in some models.

Table 7.

Prediction results with combined image and KBO numerical data (accuracy: ACC; precision: PRE).

Subset of Variables
		KBO			KBO + Face			KBO + Face PCA
		ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE
Pitcher	LR	0.773	0.869	0.783	0.700	0.763	0.6	0.711	0.789	0.613
	RF	0.758	0.865	0.704	0.652	0.674	0.267	0.756	0.769	0.613
	XGBoost	0.744	0.754	0.683	0.727	0.700	0.67	0.710	0.751	0.590
	LightGBM	0.713	0.869	0.679	0.621	0.684	0.467	0.711	0.757	0.567
	CatBoost	0.744	0.874	0.690	0.696	0.746	0.533	0.726	0.777	0.583
Batter	LR	0.911	0.933	0.933	0.888	0.978	0.72	0.889	0.967	0.72
	RF	0.844	0.933	0.77	0.822	0.928	0.683	0.867	0.956	0.733
	XGBoost	0.800	0.894	0.783	0.822	0.944	0.75	0.822	0.939	0.75
	LightGBM	0.667	0.5	0	0.667	0.5	0	0.667	0.5	0
	CatBoost	0.800	0.922	0.75	0.844	0.933	0.783	0.844	0.956	0.783

Open in a new tab

3.6. Performance improvement techniques—combining KBO data, injury data, and KBO team rankings for comprehensive analysis

For pitchers, considering their higher risk of injuries [28], we included injury history and team ranking as additional variables for predictions (Table 6). Because batters have a lower risk of injury compared with pitchers, we only added the team ranking as an independent variable for the analysis. The modeling results are presented in Table 8, Table 9. The analysis revealed that the team ranking did not have a significant impact on predictions for both pitchers and batters. However, for pitchers, when injury history data were added, all three metrics (accuracy, AUC, and precision) exhibited improved performance. In particular, the XGBoost model exhibited a substantial increase in precision from 0.683 to 0.96, demonstrating the best performance among the models.

Table 8.

Prediction Results with Combined Injury history of Pitchers and KBO Team Ranking (accuracy: ACC; precision: PRE).

Subset of Variables
		KBO			KBO + Team ranking			KBO + injury
		ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE
Pitcher	LR	0.773	0.869	0.783	0.788	0.849	0.783	0.850	0.924	0.827
	RF	0.758	0.865	0.704	0.759	0.855	0.763	0.879	0.942	0.870
	XGBoost	0.744	0.754	0.683	0.698	0.758	0.640	0.924	0.869	0.96
	LightGBM	0.713	0.869	0.679	0.713	0.828	0.667	0.713	0.885	0.680
	CatBoost	0.744	0.874	0.690	0.729	0.837	0.677	0.924	0.959	0.920

Open in a new tab

Table 9.

Prediction results with combined batters’ KBO team ranking and image (accuracy: ACC; precision: PRE).

Subset of Variables
		KBO			KBO + Team ranking			KBO + Face
		ACC	AUC	PRE	ACC	AUC	PRE	ACC	AUC	PRE
Batter	LR	0.911	0.933	0.933	0.911	0.933	0.933	0.888	0.978	0.72
	RF	0.844	0.933	0.77	0.822	0.922	0.77	0.822	0.928	0.683
	XGBoost	0.800	0.894	0.783	0.822	0.901	0.833	0.822	0.944	0.75
	LightGBM	0.667	0.5	0	0.667	0.5	0	0.667	0.5	0
	CatBoost	0.800	0.922	0.75	0.800	0.922	0.75	0.844	0.933	0.783

Open in a new tab

3.7. Feature importance

Performance analysts and scouts at KBO rely on MLB and MiLB performances and their own insights to scout for foreign players. When predicting contract renewals based on MLB and MiLB data, better prediction results were obtained based on MLB data. The major difference between the MiLB and MLB data lies in the availability of tracking data. Tracking data refer to the records of player and ball movements tracked during games, which are primarily collected using technologies such as video, radar, and optical image sensors. These data are used to analyze player movements and strategies and to evaluate game performance [29]. Online platforms such as Baseball Savant have been providing advanced statistics, analyses, and visualization tools for MLB games since 2015, including access to tracking data. Such tracking data provide latest performance metrics for players and serve as a valuable source for evaluating injury risk [30]. When predicting contract renewals for pitchers using MLB data, the XGBoost model exhibited higher performance in respect of precision, whereas the Catboost model exhibited a higher AUC than that of the other models. SHAP was used to explore these important features, as shown in Fig. 2.

For the XGBoost model, the importance of tracking data such as exit velocity, ERA, angle, and XWOBA was high (Fig. 2(a)). For the CatBoost model, the importance of data such as exit velocity, WOBA, ERA, and balance (Fig. 2(b)) was found to be high. In both models, exit velocity emerged as the most important feature. Exit velocity is a metric that represents the speed of the ball hit by a batter and is measured in miles per hour. It is measured by assessing how fast the ball leaves the bat after the batter contacts the pitch thrown by the pitcher. This metric is primarily used to evaluate and analyze the hitting power and skill of a batter. It indicates how forcefully a batter struck the ball and how quickly the ball traveled. The harder the hit, the higher the exit velocity. High exit velocity typically implies that the ball moves faster and travels a longer distance, indicating a more powerful hit. Therefore, a lower exit velocity is desirable for pitchers.

Histograms (a)–(g) in Fig. 3 represent the exit velocity metric in the MLB, categorized by season, for foreign pitchers who were initially contracted in KBO. The white line in the middle of histograms represents the season-wise average, whereas the red line represents the numbers of the re-contracted foreign players. Numbers within parentheses indicate rankings. Histogram (h) depicts the MLB records for foreign players playing in KBO during the 2023 season, based on their performance in the 2022 MLB season.

Although the performances of some foreign pitchers (red lines in histograms) were higher than the MLB average (white lines in histograms) in certain seasons, no significant difference in season-wise performances was found. Overall, it is evident from the histograms that even when the exit velocity was lower than average, foreign pitchers were still able to secure contract renewals in KBO.

4. Discussion

In this study, we used accuracy, AUC, and precision as performance evaluation metrics. Precision is an evaluation metric commonly used in machine learning and statistics. It represents the ratio of samples actually classified as positive to those predicted as positive by a classification model. Precision is calculated as follows:

Precision = TP / (TP + FP)

where TP refers to the number of true-positive cases correctly predicted as positive, and FP refers to the number of false-positive cases incorrectly predicted as positive. In this study, precision represents the ratio of players predicted to be eligible for contract renewal to those whose contracts were actually renewed. These rates were 35 % for pitchers and 37 % for batters.

When using MiLB data, a prediction precision of 0.463 was achieved for pitchers using the XGBoost model; for batters, a precision of 0.507 was achieved using the LR model. These figures indicate increases of approximately 11 % and 13 %, respectively, when compared with the actual contract renewal rate. When the MLB data were used to predict contract renewal, the XGBoost model yielded a prediction precision of 0.463 for pitchers, whereas the LR model yielded a precision of 0.6 for batters. This indicates a 23 % increase in prediction precision for batters when compared with the actual contract renewal rate. When only image data were used, the prediction precision for pitchers was 0.580 in the CatBoost model, whereas it was 0.430 for the batters using the XGBoost model. However, when the image and numerical data were merged, the prediction precision for pitchers and batters increased to 0.5 and 0.533, respectively, using the XGBoost model. For pitchers, this represents an approximately 4 % increase when compared with the precision using only numerical data and a 15 % increase when compared with the actual contract renewal rate. Overall, the performance of the CatBoost, XGBoost, and LR models was good. Particularly when using KBO data, the LR model exhibited the highest performance. However, CatBoost had considerably slower training speeds compared to LR. LightGBM, on the other hand, had fast training speeds but lower performance. Overall, the performance and speed of the XGBoost model were generally satisfactory when considering the overall results.

In KBO, the salary of the newly recruited players is limited to a maximum of approximately $780,000 [31]. When a team releases a foreign player during a season, it incurs a roughly $2 million loss. Therefore, increasing the precision of prediction has significant economic benefits for the team.

Based on Fig. 3, the distribution of MLB performance among the re-contracted players was examined using SHAP. In the 2015 season, except for one player, all three re-contracted players had lower (better) than average (Fig. 3(a)) records. This indicates that the level of performance of foreign players who secured contract renewals in KBO was high, suggesting a high level of their performance in KBO. South Korea won the Premier 12 tournament in 2015. Pat Dean, who recorded a performance higher than the average in the MLB 2016 season (Fig. 3(b)), performed well in KBO in 2017 and secured contract renewal. In 2017, South Korea, which had players without outstanding records, was eliminated in the qualifying round of the World Baseball Classic, finishing 11th. In 2018, scouting was performed based on player performance in the 2017 MLB season. When examining the 2017 season records of foreign players who were re-contracted in KBO in 2018, their performances were not significantly lower than or different from the average (Fig. 3(c)). Except for Seth Frankoff (1), whose evaluation as a player was uncertain owing to recording only 37 pitches in the 2017 season, this interpretation suggests a high level of performance in KBO. In 2018, South Korea actually emerged as the champions in the Asian Games. From 2018 onward, a significant number of foreign pitchers who had a lower-than-average level were among those who were re-contracted (Fig. 3(d)–(g)). In 2020, most players re-contracted in KBO had scores higher than the 2019 MLB average (Fig. 3(e)). This confirms that KBO level was relatively low for this season. In fact, during the 2020 Tokyo Olympics, South Korea finished fourth among the six participating countries. Felix Pena (Fig. 3(g)), with a significantly low pitch count of 65 in the 2021 season, cannot be considered favorable, even if his performance was lower than average. Fig. 3(h) depicts the 2022 MLB performance of foreign pitchers recruited in KBO for the 2023 season. As of July 2023, at the end of the first half of KBO season, Eric Fedde (4) maintained the top position in wins (ERA) and the second position in WHIP [3]. Thus, by observing lower-ranked MLB players maintaining top positions in KBO, one can infer that the level of KBO is declining. This explains why South Korea finished 12th and failed to advance in the World Baseball Classic on three consecutive occasions, as seen in the tournament held in March. Therefore, through post-analysis, the performance in international competitions can be used to estimate the important parameters used in predicting contract renewal decisions (the prediction of advancing to the main tournament).

As seen in Fig. 2, Fig. 3, we observed that the exit velocity metric has a significant impact on contract renewal decisions and can also be used to infer the baseball level of the respective country. Exit velocity is a metric that evaluates the pure abilities of pitchers, independent of the team's influence. Therefore, it can be applied in foreign baseball leagues when recruiting foreign players. In MLB, exit velocity is considered a factor in limiting hard contact and is regarded as an excellent pitching evaluation metric [32]. This study has demonstrated that such tracking data marks the beginning of a new era in baseball fandom [33].

5. Conclusions

In this study, we used the MLB, MiLB, KBO, and image data to predict the chances of the contract renewal of foreign players. We used accuracy, AUC, and precision as evaluation metrics, focusing on the precision during the analysis. For the MLB and MiLB data, we performed preprocessing by standardizing the data based on season owing to different seasonal means and variances. We used LR, RF, XGBoost, LightGBM, and CatBoost classification models, maintaining a fixed seed number, and performed fivefold cross-validation. The results showed that using only MiLB data resulted in a lower accuracy and AUC prediction performance. However, the prediction performances for both pitchers and hitters exhibited an increase of more than 11 % in precision when compared with their actual contract renewal rates. When using MLB data, although accuracy and AUC were still low, a precision of 0.6 was achieved for hitters after season-based standardization, which was 0.23 higher than the actual contract renewal rate. For hitters, when the numerical and image data were merged, the XGBoost model demonstrated performances of 0.722, 0.742, and 0.533 (ACC, AUC, and PRE, respectively). This showed an improvement in performance by 0.15, 0.27, and 0.13 (ACC, AUC, and PRE, respectively) when compared with that using only numerical data with XGBoost. Through exploratory data analysis using SHAP, exit velocity was found to be an important feature for predicting contract renewal. This metric can also be used to predict the performance of players in international baseball competitions. However, a pitcher's injury history had a significant influence on player performance. Although the small amount of available data is a limitation, the insights obtained from this study can help reduce scouting failures and financial losses for teams. Furthermore, as long as baseball exists as a sport, the insights from this study can be useful to teams scouting for foreign players.

Ethics declarations

Informed consent was not required for this study because no human or animal subjects or samples were involved in the study.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1A2C1092808). This study was also supported by the Soonchunhyang University Research Fund.

Data availability statement

Has data associated with your study been deposited into a publicly available repository?

Yes. All data from this study are in the article, code, supplementary material, or referenced. Datasets are available on GitHub at https://github.com/Ptaeshin/Prediction-of-Re-signing-Foreign-Players/tree/master or upon request from the author.

CRediT authorship contribution statement

Taeshin Park: Writing – review & editing, Writing – original draft, Visualization, Software, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Jaeyun Kim: Writing – review & editing, Validation, Supervision, Resources, Project administration, Funding acquisition, Conceptualization.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A.

Table 10 lists the MiLB and MLB data used in this study to evaluate the performance of pitchers for the prediction of their contract renewal [[6], [7], [8]].

Table 10.

Description of MiLB and MLB pitcher data used for performance evaluation.


	Metric	Abbreviation	Description
M i L B	Wins	W	Number of wins
	Losses	L	Number of losses
	Saves	SV	Number of saves
	Games	G	Number of games in which the pitcher appeared
	Games Started	GS	Number of games the pitcher started
	Inning Pitched	IP	Number of total innings pitched (.1 represents 1/3 of an inning, .2 represents 2/3 of an inning)
	Strikeouts per 9 innings	K/9	Average number of strikeouts per 9 innings
	Walks per 9 innings	BB/9	Average number of walks per 9 innings
	Home Runs per 9 innings	HR/9	Average number of home runs allowed per 9 innings
	Batting Average on Balls in Play	BABIP	The rate at which the pitcher allows a hit when the ball is put in play, calculated as (H-HR)/(AB-K-HR + SF)
	Left On-Base Percentage	LOB %	Percentage of pitcher's own base runners that they strand over the course of a season. Not equal to the LOB column in the box score
	Ground Ball Percentage	GB %	Percentage of a pitcher's balls in play that are ground balls, calculated as GB/BIP
	Home Run to Fly Ball Rate	HR/FB	Percentage of a pitcher's fly balls that go for home runs, calculated as HB/FB (even though some HR are line drives)
	Earned Run Average	ERA	Average number of earned runs a pitcher allows per 9 innings: ((ER*9)/IP)
	Fielding Independent Pitching	FIP	Estimate of a pitcher's ERA based on strikeouts, walks/HBP, and home runs allowed, assuming league average results on balls in play
	Expected Fielding Independent Pitching	xFIP	Estimate of a pitcher's ERA based on strikeouts, walks/HBP, and fly balls allowed, assuming league average results on balls in play and home run to fly ball ratio
	Strikeout to Walk Ratio	K/BB	Strikeouts divided by walks
	Strikeout Percentage	K%	Strikeouts divided by walks
	Walk Percentage	BB%	Frequency with which the pitcher has issued a walk, calculated as walks divided by total batters faced
	Strikeout Percentage minus Walk Percentage	K-BB%	Percentage differential between K% and BB%, often a better indicator of performance than K/BB, which can be skewed by very low walk rates
	Batting Average Against	AVG	Rate of hits allowed per at bat, calculated as H/AB
	Walks Plus Hits per Inning Pitched	WHIP	Average number of base runners allowed via hit or walk per inning
	Ground Ball to Fly Ball Ratio	GB/FB	Ratio of ground balls a pitcher allows to fly balls, calculated as GB/FB
	Line Drive Percentage	LD%	Percentage of a pitcher's balls in play that are line drives, calculated as LD/BIP
	Fly Ball Percentage	FB%	Percentage of a pitcher's balls in play that are fly balls, calculated as FB/BIP
	Infield Fly Ball Percentage	IFFB%	Percentage of a pitcher's fly balls that were infield fly balls, calculated as IFFB/FB
	Pull Percentage	Pull%	Percentage of batted balls hit to the pull field
	Center Percentage	Cent%	Percentage of batted balls hit to the middle of the field
	Opposite Field Percentage	Oppo%	Percentage of batted balls hit to the opposite field
M L B	Age	Age	Player's age at a specific point in time.
	Pitches	Pitches	Number of total pitches thrown
	Balls	Balls	Number of total balls thrown
	Exit Velocity	Exit Velocity	How fast, in miles per hour, a batter hit a ball
	Maximum Exit Velocity	Max EV	Highest recorded speed at which a batted ball leaves the bat
	Launch Angle	Angle	How high/low, in degrees, a batter hit a ball
	Sweet Spot	Spot %	a batted-ball event with a launch angle between eight and 32°
	Barrels	Barrels	Batted ball with the perfect combination of exit velocity and launch angle
	Barrel per Plate Appearance	Barrel/PA	Average number of barrels a batter produces per plate appearance.
	Expected Batting Average	XBA	Likelihood that a batted ball will become a hit
	Expected Slugging Percentage	XSLG	Estimates slugging percentage a player is expected to have based on the quality and outcomes of their batted balls
	Weighted On-Base Average	WOBA	Combined different hitting outcomes with weighted values to assess a player's overall offensive contribution
	Expected Weighted On-base Average	XWOBA	Formulated using exit velocity, launch angle and, on certain types of batted balls, Sprint Speed
	Expected Weighted On-Base Average on Contact	XWOBACON	Estimates weighted on-base average a player is expected to have solely on balls put into play
	Hard Contact Percentage	HardHit %	Percentage of hard-hit batted balls
	Strikeout Percentage	K%	Frequency with which the pitcher has struck out a batter, calculated as strikeouts divided by total batters faced
	Walk Percentage	BB%	Frequency with which the pitcher has issued a walk, calculated as walks divided by total batters faced
	Earned Run Average	ERA	Average number of earned runs a pitcher allows per 9 innings. ((ER*9)/IP)
	Expected Earned Run Avg	xERA	Simple 1:1 translation of xwOBA, converted to the ERA scale
K B O	Earned Run Average	ERA	Average number of earned runs a pitcher allows per 9 innings. ((ER*9)/IP)
	Walks Plus Hits per Inning Pitched	WHIP	Average number of base runners allowed via hit or walk per inning
	Fielding Independent Pitching	FIP	Estimate of a pitcher's ERA based on strikeouts, walks/HBP, and home runs allowed, assuming league average results on balls in play
	Wins Above Replacement	WAR	Estimates the number of wins a player has been worth to his team compared to a freely available player such as a minor league free agent based on FIP

Open in a new tab

Appendix B.

Table 11 lists the MiLB and MLB data used in this study to evaluate the performance of batters for the prediction of their contract renewal [[6], [7], [8]].

Table 11.

Description of MiLB and MLB batter data used for performance evaluation.


	Metric	Abbreviation	Description
M i L B	Games	G	Number of games in which the player has appeared
	Plate Appearances	PA	Number of times the player has come to the plate
	At Bats	AB	Number of trips to the plate in which the batter does not walk, get hit by a pitch, sacrifice (fly or bunt), or reach on interference
	Runs Scored	R	Number of runs scored
	Hits	H	Number of hits
	Doubles	2B	Number of doubles
	Triples	3B	Number of triples
	Home Runs	HR	Number of home runs
	Runs Batted In	RBI	Number of times a run scores as a result of a batter's plate appearance, not counting situations in which an error caused the run to score, or the batter hit into a double play
	Stolen Bases	SB	Number of stolen bases
	Caught Stealing	CS	Number of times caught stealing
	Walks	BB	Total number of walks (includes IBB)
	Strikeouts	SO	Number of strikeouts
	Batting Average	AVG	Rate of hits per at bat, calculated as H/AB
	On Base Percentage	OBP	Rate at which the batter reaches base, calculated as (H + BB + HBP)/(AB + BB + HBP + SF)
	Slugging Percentage	SLG	Average number of total bases per at bat, calculated as Total Bases/AB
	On Base Plus Slugging	OPS	Combination of OBP and SLG, calculated as OBP + SLG
	Grounded into Double Play	GDP	Number of times the batter hit into a double play
	Hit By Pitches	HBP	Number of times the batter reached after being hit by a pitch
	Sacrifice Bunts	SH	Any bunt in which there was a runner on base and less than two outs in which the batter was put out and at least one runner advanced
	Sacrifice Flies	SF	Number of times a batter's fly out allowed a runner to tag up and score
	Intentional Walks	IBB	Number of times the batter was intentionally walked
M L B	Age	Age	Player's age at a specific point in time.
	Pitches	Pitches	Number of total pitches thrown
	Batted Balls	Batted Balls	Balls hit by a batter during an at-bat that are put into play, excluding foul balls, bunts, and certain other types of non-playable contact.
	Barrels	Barrels	Batted ball with the perfect combination of exit velocity and launch angle
	Barrel percentage	Barrel %	Rate at which a batter hits the ball on the sweet spot of the bat, resulting in optimal contact and a higher likelihood of successful outcomes such as extra-base hits.
	Barrel per Plate Appearance	Barrel/PA	Average number of barrels a batter produces per plate appearance.
	Exit Velocity	Exit Velocity	How fast, in miles per hour, a batter hit a ball
	Maximum Exit Velocity	Max EV	Highest recorded speed at which a batted ball leaves the bat
	Launch Angle	Launch Angle	Angle at which the ball leaves the bat after a batter's swing
	The Sweet Spot percentage	Sweet Spot %	Probability of a batter hitting the ball on the optimal spot of the bat known as the Sweet Spot
	Expected Batting Average	XBA	Likelihood that a batted ball will become a hit
	Expected Slugging Percentage	XSLG	Estimates the slugging percentage a player is expected to have based on the quality and outcomes of their batted balls
	Weighted On Base Average	WOBA	Combines all the different aspects of hitting into one metric, weighting each of them in proportion to their actual run value
	Expected Weighted On-Base Average on Contact	XWOBACON	Estimates the weighted on-base average a player is expected to have solely on balls put into play
	Hard Contact Percentage	HardHit %	Percentage of hard-hit batted balls
	Strikeout Percentage	K%	Frequency with which the batter has struck out, calculated as strikeouts divided by plate appearances
	Walk Percentage	BB%	Frequency with which the batter has walked, calculated as walks divided by plate appearances
K B O	Age	Age	Player's age at a specific point in time.
	Games Played	G	Number of games in which the player has appeared
	Plate Appearances	PA	Number of times the player has come to the plate
	At Bats	AB	Number of trips to the plate in which the batter does not walk, get hit by a pitch, sacrifice (fly or bunt), or reach on interference
	Runs Scored	R	Number of runs scored
	Hits	H	Number of hits
	Doubles	2B	Number of doubles
	Triples	3B	Number of triples
	Home Runs	HR	Number of home runs
	Slugging	Slugging	Total number of bases a batter accumulates per at-bat
	Runs Batted In	Runs Batted In	Number of times a run scores as a result of a batter's plate appearance, not counting situations in which an error caused the run to score, or the batter hit into a double play
	Stolen Bases	SB	Number of stolen bases
	Caught Stealing	CS	Number of times caught stealing
	Walks	BB	Total number of walks
	Hit By Pitches	HBP	Number of times the batter reached after being hit by a pitch
	Intentional Walks	IBB	Number of times the batter was intentionally walked
	Strikeouts	SO	Number of strikeouts
	Grounded into Double Play	GDP	Number of times the batter hit into a double play
	Sacrifice Flies	SF	Number of times a batter's fly out allowed a runner to tag up and score
	SH	SH	Any bunt in which there was a runner on base and less than two outs in which the batter was put out and at least one runner advanced
	Batting Average	AVG	Rate of hits per at bat, calculated as H/AB
	On Base Percentage	OBP	Rate at which the batter reaches base, calculated as (H + BB + HBP)/(AB + BB + HBP + SF)
	Slugging Percentage	SLG	Average number of total bases per at bat, calculated as Total Bases/AB
	On Base Plus Slugging	OPS	Combination of OBP and SLG, calculated as OBP + SLG
	Weighted On Base Average	wOBA	Combines all the different aspects of hitting into one metric, weighting each of them in proportion to their actual run value
	Weighted Runs Created Plus	wRC+	Most comprehensive rate statistic used to measure hitting performance because it considers the varying weights of each offensive action (like wOBA) and then adjusts them for the park and league context in which they took place
	Wins Above Replacement	WAR	Comprehensive statistic that estimates the number of wins a player has been worth to his team compared to a freely available player such as a minor league free agent based on FIP

Open in a new tab

References

1.Surging interest in Korean baseball in the United States... ESPN and KBO live Broadcast. https://www.voakorea.com/a/korea_korea-life_aisasociety-baseball/6032559.html Available online:
2.http://www.statiz.co.kr/stat.php STATIZ. Available online :
3.https://www.koreabaseball.com/TeamRank/TeamRank.aspx KBO Homepage. Available online :
4.Elitzur R. Data analytics effects in major league baseball. Omega. 2020;90 [Google Scholar]
5.Fialho G., Manhães A., Teixeira J.P. Predicting sports results with artificial intelligence–a proposal framework for soccer games. Proc. Comput. Sci. 2019;164:131–136. [Google Scholar]
6.MLB Stats, Scores, History, & Records | Baseball-Reference.com. Available online: https://www.baseball-reference.com/.(Accessed 12 July 2023).
7.FanGraphs baseball | baseball statistics and analysis. https://www.fangraphs.com/ Available online :
8.Savant Baseball, Players Trending MLB. Statcast and Visualizations | baseballsavant.com. https://baseballsavant.mlb.com/ Available online :
9.https://www.prosportstransactions.com/baseball/index.htm Professional Baseball Transactions Archive. Available online :
10.Huang M.-L., Li Y.-Z. Use of machine learning and deep learning to predict the outcomes of major league baseball matches. Appl. Sci. 2021;11:4499. [Google Scholar]
11.Yaseen A.S., Marhoon A.F., Saleem S.A. Multimodal machine learning for major league baseball playoff prediction. Informatica. 2022:46. [Google Scholar]
12.Valero C.S. Predicting Win-Loss outcomes in MLB regular season games–A comparative study using data mining methods. Int. J. Comput. Sci. Sport. 2016;15:91–112. [Google Scholar]
13.Elfrink T. Vrije Universiteit Amsterdam; 2018. Predicting the Outcomes of MLB Games with a Machine Learning Approach. [Google Scholar]
14.Horvat T., Job J. The use of machine learning in sport outcome prediction: a review. Wiley Interdiscip. Rev.: Data Min. Knowl. Disc. 2020;10 [Google Scholar]
15.Osken C., Onay C. Predicting the winning team in basketball: a novel approach. Heliyon. 2022:8. doi: 10.1016/j.heliyon.2022.e12189. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Wright R.E. In: Reading and Understanding Multivariate Statistics. Grimm L.G., Yarnold P.R., editors. American Psychological Association; 1995. Logistic regression; pp. 217–244. [Google Scholar]
17.Zhang C., Ma Y. Springer; 2012. Ensemble Machine Learning: Methods and Applications. [Google Scholar]
18.Chen T., Guestrin C. Xgboost. Proceedings of the Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. 2016. A scalable tree boosting system; pp. 785–794. [Google Scholar]
19.Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.-Y. Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017;30 [Google Scholar]
20.Prokhorenkova L., Gusev G., Vorobev A., Dorogush A.V., Gulin A. CatBoost: unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018;31 [Google Scholar]
21.Parkhi O., Vedaldi A., Zisserman A. Proceedings of the British Machine Vision Conference. 2015. Deep face recognition; p. 2015. [Google Scholar]
22.Huang G.B., Mattar M., Berg T., Learned-Miller E. Proceedings of the Workshop on Faces in'Real-Life'Images: Detection, Alignment, and Recognition. 2008. Labeled faces in the wild: a database forstudying face recognition in unconstrained environments. [Google Scholar]
23.Guo Y., Zhang L., Hu Y., He X., Gao J. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference. Proceedings, Part III 14; Amsterdam, The Netherlands: 2016. Ms-celeb-1m: a dataset and benchmark for large-scale face recognition; pp. 87–102. October 11-14, 2016. [Google Scholar]
24.Girdhar R., El-Nouby A., Liu Z., Singh M., Alwala K.V., Joulin A., Misra I. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. Imagebind: one embedding space to bind them all; pp. 15180–15190. [Google Scholar]
25.Liu H., Ding H., Xuan J., Gao X., Huang X. The functional movement screen predicts sports injuries in Chinese college students at different levels of physical activity and sports performance. Heliyon. 2023;9 doi: 10.1016/j.heliyon.2023.e16454. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Trovato B., Petrigna L., Sortino M., Roggio F., Musumeci G. The influence of different sports on cartilage adaptations: a systematic review. Heliyon. 2023;9 doi: 10.1016/j.heliyon.2023.e14136. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Park T., Kim J. A predictive model for a contract renewal of foreign pitchers in KBO using machine learning. Korean Data Inf. Sci. Soc. 2022;33:963–976. [Google Scholar]
28.Shanley E., Kissenberth M.J., Thigpen C.A., Bailey L.B., Hawkins R.J., Michener L.A., Tokish J.M., Rauh M.J. Preseason shoulder range of motion screening as a predictor of injury among youth and adolescent baseball pitchers. J. Shoulder Elbow Surg. 2015;24:1005–1013. doi: 10.1016/j.jse.2015.03.012. [DOI] [PubMed] [Google Scholar]
29.Statcast Search | baseballsavant.com. https://baseballsavant.mlb.com/statcast_search Available online :
30.Pollack K.M., D'Angelo J., Green G., Conte S., Fealy S., Marinak C., McFarland E., Curriero F.C. Developing and implementing Major League Baseball's health and injury tracking system. Am. J. Epidemiol. 2016;183:490–496. doi: 10.1093/aje/kwv348. [DOI] [PubMed] [Google Scholar]
31.https://www.koreabaseball.com/News/Notice/View.aspx?bdSe=8542 KBO Homepage. Available online :
32.Exit velocity (EV) | Glossary | MLB.com. https://www.mlb.com/glossary/statcast/exit-velocity Available online :
33.https://www.mlb.com/glossary/statcast Statcast | Glossary | MLB.com. Available online :

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Has data associated with your study been deposited into a publicly available repository?

[bib1] 1.Surging interest in Korean baseball in the United States... ESPN and KBO live Broadcast. https://www.voakorea.com/a/korea_korea-life_aisasociety-baseball/6032559.html Available online:

[bib2] 2.http://www.statiz.co.kr/stat.php STATIZ. Available online :

[bib3] 3.https://www.koreabaseball.com/TeamRank/TeamRank.aspx KBO Homepage. Available online :

[bib4] 4.Elitzur R. Data analytics effects in major league baseball. Omega. 2020;90 [Google Scholar]

[bib5] 5.Fialho G., Manhães A., Teixeira J.P. Predicting sports results with artificial intelligence–a proposal framework for soccer games. Proc. Comput. Sci. 2019;164:131–136. [Google Scholar]

[bib6] 6.MLB Stats, Scores, History, & Records | Baseball-Reference.com. Available online: https://www.baseball-reference.com/.(Accessed 12 July 2023).

[bib7] 7.FanGraphs baseball | baseball statistics and analysis. https://www.fangraphs.com/ Available online :

[bib8] 8.Savant Baseball, Players Trending MLB. Statcast and Visualizations | baseballsavant.com. https://baseballsavant.mlb.com/ Available online :

[bib9] 9.https://www.prosportstransactions.com/baseball/index.htm Professional Baseball Transactions Archive. Available online :

[bib10] 10.Huang M.-L., Li Y.-Z. Use of machine learning and deep learning to predict the outcomes of major league baseball matches. Appl. Sci. 2021;11:4499. [Google Scholar]

[bib11] 11.Yaseen A.S., Marhoon A.F., Saleem S.A. Multimodal machine learning for major league baseball playoff prediction. Informatica. 2022:46. [Google Scholar]

[bib12] 12.Valero C.S. Predicting Win-Loss outcomes in MLB regular season games–A comparative study using data mining methods. Int. J. Comput. Sci. Sport. 2016;15:91–112. [Google Scholar]

[bib13] 13.Elfrink T. Vrije Universiteit Amsterdam; 2018. Predicting the Outcomes of MLB Games with a Machine Learning Approach. [Google Scholar]

[bib14] 14.Horvat T., Job J. The use of machine learning in sport outcome prediction: a review. Wiley Interdiscip. Rev.: Data Min. Knowl. Disc. 2020;10 [Google Scholar]

[bib15] 15.Osken C., Onay C. Predicting the winning team in basketball: a novel approach. Heliyon. 2022:8. doi: 10.1016/j.heliyon.2022.e12189. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Wright R.E. In: Reading and Understanding Multivariate Statistics. Grimm L.G., Yarnold P.R., editors. American Psychological Association; 1995. Logistic regression; pp. 217–244. [Google Scholar]

[bib17] 17.Zhang C., Ma Y. Springer; 2012. Ensemble Machine Learning: Methods and Applications. [Google Scholar]

[bib18] 18.Chen T., Guestrin C. Xgboost. Proceedings of the Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. 2016. A scalable tree boosting system; pp. 785–794. [Google Scholar]

[bib19] 19.Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.-Y. Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017;30 [Google Scholar]

[bib20] 20.Prokhorenkova L., Gusev G., Vorobev A., Dorogush A.V., Gulin A. CatBoost: unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018;31 [Google Scholar]

[bib21] 21.Parkhi O., Vedaldi A., Zisserman A. Proceedings of the British Machine Vision Conference. 2015. Deep face recognition; p. 2015. [Google Scholar]

[bib22] 22.Huang G.B., Mattar M., Berg T., Learned-Miller E. Proceedings of the Workshop on Faces in'Real-Life'Images: Detection, Alignment, and Recognition. 2008. Labeled faces in the wild: a database forstudying face recognition in unconstrained environments. [Google Scholar]

[bib23] 23.Guo Y., Zhang L., Hu Y., He X., Gao J. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference. Proceedings, Part III 14; Amsterdam, The Netherlands: 2016. Ms-celeb-1m: a dataset and benchmark for large-scale face recognition; pp. 87–102. October 11-14, 2016. [Google Scholar]

[bib24] 24.Girdhar R., El-Nouby A., Liu Z., Singh M., Alwala K.V., Joulin A., Misra I. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. Imagebind: one embedding space to bind them all; pp. 15180–15190. [Google Scholar]

[bib25] 25.Liu H., Ding H., Xuan J., Gao X., Huang X. The functional movement screen predicts sports injuries in Chinese college students at different levels of physical activity and sports performance. Heliyon. 2023;9 doi: 10.1016/j.heliyon.2023.e16454. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Trovato B., Petrigna L., Sortino M., Roggio F., Musumeci G. The influence of different sports on cartilage adaptations: a systematic review. Heliyon. 2023;9 doi: 10.1016/j.heliyon.2023.e14136. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Park T., Kim J. A predictive model for a contract renewal of foreign pitchers in KBO using machine learning. Korean Data Inf. Sci. Soc. 2022;33:963–976. [Google Scholar]

[bib28] 28.Shanley E., Kissenberth M.J., Thigpen C.A., Bailey L.B., Hawkins R.J., Michener L.A., Tokish J.M., Rauh M.J. Preseason shoulder range of motion screening as a predictor of injury among youth and adolescent baseball pitchers. J. Shoulder Elbow Surg. 2015;24:1005–1013. doi: 10.1016/j.jse.2015.03.012. [DOI] [PubMed] [Google Scholar]

[bib29] 29.Statcast Search | baseballsavant.com. https://baseballsavant.mlb.com/statcast_search Available online :

[bib30] 30.Pollack K.M., D'Angelo J., Green G., Conte S., Fealy S., Marinak C., McFarland E., Curriero F.C. Developing and implementing Major League Baseball's health and injury tracking system. Am. J. Epidemiol. 2016;183:490–496. doi: 10.1093/aje/kwv348. [DOI] [PubMed] [Google Scholar]

[bib31] 31.https://www.koreabaseball.com/News/Notice/View.aspx?bdSe=8542 KBO Homepage. Available online :

[bib32] 32.Exit velocity (EV) | Glossary | MLB.com. https://www.mlb.com/glossary/statcast/exit-velocity Available online :

[bib33] 33.https://www.mlb.com/glossary/statcast Statcast | Glossary | MLB.com. Available online :

PERMALINK

Machine learning-based optimization of contract renewal predictions in Korea Baseball organization

Taeshin Park

Jaeyun Kim

Abstract

1. Introduction

2. Materials and methods

2.1. Machine learning models

2.2. VGGFace

2.3. Data sources and collection

2.4. Research framework

Fig. 1.

3. Results

3.1. Re-contracting prediction based on MiLB records

3.1.1. Nonstandardization

Table 1.

3.1.2. Preprocessing—standardization

Table 2.

3.2. Re-contracting prediction based on MLB records

3.2.1. Nonstandardization

Table 3.

3.2.2. Preprocessing—standardization

Table 4.

3.3. Re-contracting prediction based on face image optimization

Table 5.

3.4. Performance improvement techniques—image data fusion

Table 6.

3.5. Re-contracting prediction based on KBO records

Table 7.

3.6. Performance improvement techniques—combining KBO data, injury data, and KBO team rankings for comprehensive analysis

Table 8.

Table 9.

3.7. Feature importance

Fig. 2.

Fig. 3.

4. Discussion

5. Conclusions

Ethics declarations

Funding

Data availability statement

CRediT authorship contribution statement

Declaration of competing interest

Appendix A.

Table 10.

Appendix B.

Table 11.

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases