Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID‐19

Gizemnur Erol; Betül Uzbaş; Cüneyt Yücelbaş; Şule Yücelbaş

doi:10.1002/cpe.7393

. 2022 Oct 18;34(28):e7393. doi: 10.1002/cpe.7393

Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID‐19

Gizemnur Erol ^1,^✉, Betül Uzbaş ², Cüneyt Yücelbaş ³, Şule Yücelbaş ⁴

PMCID: PMC9874401 PMID: 36714180

Summary

Real‐time polymerase chain reaction (RT‐PCR) known as the swab test is a diagnostic test that can diagnose COVID‐19 disease through respiratory samples in the laboratory. Due to the rapid spread of the coronavirus around the world, the RT‐PCR test has become insufficient to get fast results. For this reason, the need for diagnostic methods to fill this gap has arisen and machine learning studies have started in this area. On the other hand, studying medical data is a challenging area because the data it contains is inconsistent, incomplete, difficult to scale, and very large. Additionally, some poor clinical decisions, irrelevant parameters, and limited medical data adversely affect the accuracy of studies performed. Therefore, considering the availability of datasets containing COVID‐19 blood parameters, which are less in number than other medical datasets today, it is aimed to improve these existing datasets. In this direction, to obtain more consistent results in COVID‐19 machine learning studies, the effect of data preprocessing techniques on the classification of COVID‐19 data was investigated in this study. In this study primarily, encoding categorical feature and feature scaling processes were applied to the dataset with 15 features that contain blood data of 279 patients, including gender and age information. Then, the missingness of the dataset was eliminated by using both K‐nearest neighbor algorithm (KNN) and chain equations multiple value assignment (MICE) methods. Data balancing has been done with synthetic minority oversampling technique (SMOTE), which is a data balancing method. The effect of data preprocessing techniques on ensemble learning algorithms bagging, AdaBoost, random forest and on popular classifier algorithms KNN classifier, support vector machine, logistic regression, artificial neural network, and decision tree classifiers have been analyzed. The highest accuracies obtained with the bagging classifier were 83.42% and 83.74% with KNN and MICE imputations by applying SMOTE, respectively. On the other hand, the highest accuracy ratio reached with the same classifier without SMOTE was 83.91% for the KNN imputation. In conclusion, certain data preprocessing techniques are examined comparatively and the effect of these data preprocessing techniques on success is presented and the importance of the right combination of data preprocessing to achieve success has been demonstrated by experimental studies.

Keywords: COVID‐19, KNN imputation, machine learning, multivariate imputation by chained equation, synthetic minority oversampling technique

1. INTRODUCTION

COVID‐19 is a pandemic disease that takes hold of the world and can spread very quickly. This disease first appeared in Wuhan, China in December 2019 and was detected in more than 200 countries around the world within 4 months. This disease can cause permanent discomfort and even death by causing serious effects in people ¹ with chronic conditions such as respiratory tract, diabetes, cancer, and heart disease. According to the World Health Organization's July 2022 report, ² approximately 549 million confirmed cases and more than 6 million deaths have occurred since the outbreak of the pandemic Despite having been familiar with COVID‐19 disease worldwide for about 2 year, it is still not known exactly how to deal with this pandemic disease. For this reason, hundreds of millions of people have been quarantined to prevent the spread of infection, and some dynamic service sectors that bring people together too much have been stopped. However, this spread cannot be stopped because COVID‐19 symptoms which usually manifest themselves asymptomatically ³ cannot be clearly identified between positive and negative individuals. Another main reason why the spread cannot be stopped is that the virus has the ability to spread more easily by undergoing too many genomic variant changes. ⁴ Looking at the point reached in the fight against COVID‐19 today, the biggest evolution is the discovery of vaccines for the disease. Thanks to the developed vaccines, although the effects of the disease seem to be alleviated in the infected individuals, too many genomic variants of the virus cannot prevent the spread of the disease. Therefore, since there is no definitive cure for COVID‐19 in the short term, the main goal has been to reduce the infectivity of the disease. The primary step in reducing the contagiousness of the disease is to identify infected individuals and to obtain rapid results from detection studies carried out for this purpose. COVID‐19 detection methods currently used for the detection of infected individuals are: Real‐time polymerase chain reaction (RT‐PCR), computed tomography, and certain blood parameters. ⁵ , ⁶ , ⁷ RT‐PCR, which plays an active role in identifying COVID‐19 carriers who do not experience symptoms related to the virus, which is called asymptomatic and commonly known as swab test, is the most widely used method for the diagnosis of COVID‐19 disease through respiratory samples in the laboratory. ⁵ Due to the rapid spread of the virus worldwide, there has been a high demand for RT‐PCR tests. This increase in demand has revealed many limitations of the RT‐PCR method as disadvantages, such as the ability to make the diagnosis in an average of 2–3 h, the need for certified laboratories, trained personnel, and expensive equipment. ⁶ For this reason, with the increasing demand for COVID‐19 detection tests day by day, an urgent need for faster alternative diagnosis methods has arisen, and the studies of artificial intelligence‐based disease detection systems in the field of machine learning have gained momentum. A preliminary assessment of the disease to be detected is expected by giving the clinical and examination findings of the patients as input to these systems. By using this evaluation carried out by the system, experts in health institutions are enabled to make faster and more accurate diagnoses and the most appropriate treatment protocols for patients are initiated. Additionally, blood parameters, have important clinical values for infectious diseases. ⁸ , ⁹ Therefore, blood parameters are important data used by medical professionals in diagnosing COVID‐19. When the studies carried out for this purpose were examined, Sun et al. statistically examined the differentiation of blood values such as leukocytes, monocytes, and platelets for the disease in the diagnosis of COVID‐19. ¹⁰ Guan et al. analyzed the clinical blood data of 1000 COVID‐19 patients and determined the anomaly in the blood values of individuals. ¹¹ Göreke et al., a COVID‐19 detection model was developed considering genetic differences in blood values. ¹² In addition to them, many blood parameter values such as white blood cells, platelets, and C‐reactive proteins of infected and uninfected individuals were analyzed comparatively. ¹³ Studies have shown that blood values can show ethnic and genetic differences the use of blood parameters in the detection of COVID‐19 with machine learning models developed can be a determining factor in the diagnosis of the disease.

Today, it is possible to reach both unidimensional and multidimensional data on many diseases, including COVID‐19, thanks to the ability to store large amounts of information on the network. With data mining studies, these data can be effectively processed and can be made sense. ¹⁴ Data mining is a method of reaching effective information by obtaining high‐level information from low‐level data. To achieve successful results in data mining studies where artificial intelligence is also used as an active tool, the desired goal and the success criterion in achieving this goal must be determined clearly. In addition, the preparation of data plays an active role in increasing the success of these studies too. ¹⁵ Therefore, data mining is an indispensable step for artificial intelligence studies. As it is known medical data are datasets that are difficult to work with because the data it contains is not precise, inconsistent, incomplete and the data is very large. ¹⁶ In addition, studies with medical data adversely affect the accuracy of studies performed due to some poor clinical decisions, instability of irrelevant parameters, and limited availability of these datasets. ¹⁷ Therefore, data mining is taken advantage of to develop an intelligent medical tool. And in this way, researchers in the field of medicine began to obtain effective success rates and predictions from the medical data they studied with the help of data mining methods. ¹⁸

Some data‐related factors reduce classification performance in artificial intelligence‐based studies conducted in the field of machine learning. These are situations such as missingness and imbalances, repetitive data, or outliers, often found in datasets. Therefore, many data preprocessing techniques have been developed to increase classification performance in machine learning studies. ¹⁹ Data preprocessing is one of the major steps in data mining and also data preprocessing is very important for medical datasets, especially due to its various qualities. In medical data, many irrelevant data affect accuracy by distorting the information. ²⁰ This is why preprocessing is a critical point for a study before training medical data. However, each dataset is different and there is no successful preprocessing technique valid for all datasets. Therefore, the most successful preprocessing steps for a dataset are found by trial and comparison. ²¹ this article, as a solution to the need for a rapid diagnosis method emerging with the increasing demand for diagnostic tests, other COVID‐19 diagnostic studies conducted on the blood tests of infected and noninfected individuals were also examined, and the impact of certain data preprocessing techniques on the machine learning model that detects COVID‐19 infection was analyzed comparatively. In addition, it was aimed to improve these datasets to increase the usability of datasets containing COVID‐19 blood parameters, which are less common than other medical datasets. For this purpose, with these improved limited datasets, more consistent results were obtained in COVID‐19 machine learning studies and the effect of certain data preprocessing techniques on the classification of COVID‐19 data was presented. The process of the realized working model is presented in Figure 1.

CPE-7393-FIG-0001-b — Flow diagram of the study

As can be seen in Figure 1, first encoding categorical features and feature scaling processes were applied to the dataset consisting of blood parameters of the patients. Later, the missingness of the dataset was eliminated by using both KNN and MICE methods. Data balancing has been done with synthetic minority oversampling technique (SMOTE) which is a data balancing method, and classification studies have been carried out with various algorithms. The effect of data preprocessing techniques on these classification algorithms has been analyzed.

Although certain data preprocesses were applied to the datasets used in other machine learning‐based COVID‐19 studies ⁵ , ¹³ , ²² , ²³ that have been examined in the literature, the efficiency of data preprocessing techniques on success was not examined, only the features and classifier models contained in the dataset were evaluated in the model success activity. Different from the literature, this study has two main aims:

To design a machine learning model that makes positive–negative classification of COVID‐19 data, but also to reveal the most successful classifier model by trying certain data preprocessing combinations on COVID‐19 data.
To propose an integrated dataset‐data preprocessing combination for COVID‐19 studies to be carried out with datasets containing blood parameters.

This article is organized as follows: the COVID‐19 dataset and preprocessing methods used in this study are mentioned in Section 2. Experimental studies and hypothesis testing are given in Section 3. The outcomes of the experiments are discussed in Section 4, additionally; this section presents the comparison of this work with similar studies. The threats to the validity of our research were discussed in Section 5. Finally, in Section 6, the results are summarized, and the article is concluded by suggesting some additions to the research.

2. METHODS

The dataset used in this study ⁵ consists of the information of 279 patients who were admitted at San Raffaele Hospital between February and March 2020 in Milan, Italy, and whose blood samples were taken. The dataset includes, in addition to the age and gender values of each patient, certain values in the routine blood tests of patients and RT‐PCR results that have 177 of which are positive and 102 of which are negative, which are swab test results. The features in the dataset are presented in Table 1 with the parameter information obtained from the routine blood tests of the patients.

TABLE 1.

Feature and data types in the dataset ⁵

Feature	Data type
Gender	Categorical
Age	Numerical (Discrete)
Leukocytes (WBC)	Numerical (Continuous)
Platelets	Numerical (Continuous)
C‐reactive protein (CRP)	Numerical (Continuous)
Transaminases (AST)	Numerical (Continuous)
Transaminases (ALT)	Numerical (Continuous)
Alkaline phosphatase (ALP)	Numerical (Continuous)
Gama‐glutamil transferasi (GGT)	Numerical (Continuous)
Lactate dehydrogenase (LDH)	Numerical (Continuous)
Neutrophils	Numerical (Continuous)
Lymphocytes	Numerical (Continuous)
Monocytes	Numerical (Continuous)
Eosinophils	Numerical (Continuous)
Basophils	Numerical (Continuous)
Swab	Categorical

Open in a new tab

2.1. Data preprocessing

Some factors reduce the classification performance in artificial intelligence‐based diagnostic systems realized by using biomedical datasets. For this reason, various data preprocessing techniques suitable for datasets are used to increase the classification performance. A data analysis directly depends on both the data preprocessing step and the techniques chosen for this purpose. ²⁴ While the importance of data preprocessing is so evident, it is very important to find the most suitable data preprocessing techniques for the study to be carried out. In this study carried out in this direction, certain data preprocessing techniques on blood tests of infected and noninfected individuals were analyzed and the effect of these techniques on the diagnosis of COVID‐19 was examined.

In this study, the dataset was applied to encode categorical values with one‐hot encoding, min‐max feature scaling in the range of [0–1], filling the missing data using KNN and MICE methods and data balancing with SMOTE method.

2.1.1. Encoding categorical values

Machine learning algorithms do not work directly on categorical data. Because of this, machine learning models can work correctly by transforming categorical data into their numerical equivalents. In this study, the categorical gender feature in the dataset was transformed into binary number representation using the one‐hot encoding ⁵ , ²⁵ , ²⁶ method. Since it is desired to distinguish between two types of genders, male and female, it is possible to represent this information using this coding, as shown in Figure 2.

CPE-7393-FIG-0002-b — Graphical representation one‐hot encoding for this study

The fact that categories are represented as independent concepts is an interesting feature for one‐hot encoding, and the way to see this is by calculating that the inner product between any two vectors is zero and that each vector is equidistant from each other in Euclidean space. ²⁷

2.1.2. Feature scaling

Each feature in datasets is an input variable and can have different scales (units). These scale differences in input variables can cause various problems in the machine learning model will be created. The purpose of feature scaling is to minimize the problems that may arise by making sure that all features are at approximately the same scale. In consequence of feature scaling, each feature is made equally important, making it easy to process by machine learning algorithms. The features and the units in the dataset used in the study are shown in Table 2.

TABLE 2.

Features and units in the dataset ¹³

Feature	Unit
WBC	×10⁹ cells/L
Neutrophils	×10⁹ cells/L
Lymphocytes	×10⁹ cells/L
Monocytes	×10⁹ cells/L
Eosinophils	×10⁹ cells/L
Basophils	×10⁹ cells/L
Platelets	×10⁹ cells/L
CRP	mg/L
AST	U/L
ALT	U/L
ALP	U/L
GGT	U/L
LDH	U/L

Open in a new tab

As can be understood from Table 2, not all features have the same scales. Therefore, in this study, feature scaling operation has been performed with the min‐max scalar method in the range of “0–1.” This process is illustrated by Formula (1). According to Formula (1), X _new is the rescaled value; X _i is the original value, also the minimum and maximum values in the properties are min(X) and max(X), respectively.

X_{new} = \frac{X_{i} - \min (X)}{\max (X) - \min (X)} .

(1)

It is important point to be mentioned here is that feature scaling is performed before the missing data filling preprocessing, and the missing data filling algorithms are prevented from generating erroneous results by processing data of different scales.

2.1.3. Filling missing data

Being missing data in datasets is one of the main problems frequently encountered in machine learning. Using datasets with missing data negatively affects data analysis and decision processes. Therefore, many methods have been developed to fill missing data. The success of the method to fill the missing data is affected by many factors such as the characteristics of the data and the type of missing data. ²⁸ For this reason, choosing the appropriate method to fill the missing data in datasets is an important and necessary preliminary process in machine learning studies.

There are many data imputation methods in the literature, but most of these methods are likely to waste valuable data or reduce variability in the dataset. However, the KNN and MICE imputers preserve a great deal of the value and variability of the datasets. For this reason, two imputers mentioned for data filling were chosen in this study. As a result, two separate sub‐datasets to be analyzed were obtained by filling in the missing data of the used dataset with both KNN and MICE.

Missing data filling with KNN

The missing data in the study were first filled with the KNN ²³ , ²⁹ method. Basically, KNN is a classification algorithm and uses this classification logic to fill missing data too. This algorithm is used for many data such as continuous, discrete, ordinal, and categorical. Therefore, this situation makes KNN privileged in that it can be applied to all kinds of missing data.

As can be seen in Figure 3, by measuring the distance between the missing data and the variables with no missing data around it, k observations with the closest d distance of the missing data are discovered, and the missing data is filled with the average of these discovered observations. The parameter k here indicates the number of nearest neighbors, that is, the number of closest observations. With the KNN performed in this study, the “k” value was taken as 5 in filling the missing data.

CPE-7393-FIG-0003-c — Missing value estimation using KNN

Missing data filling with MICE

Another method used in filling missing data in this study is MICE. ⁵ MICE, which is a method of multiple imputations is the process of filling all missing data multiple times. ³⁰ This algorithm fills in missing data in a dataset through an iterative set of predictive models. Missing data to be filled in each iteration is filled by using other data in the dataset. Iterations continue until the convergence between missing data and other data reaches the optimum level.

A schematic representation of how the MICE method works with a data packet with missing values is shown in Figure 4. The sequential use of the mice (), with (), and pool () functions here requires attention. As seen in Figure 4, the MICE method performs the imputation process by first evaluating all variables and creating a prediction model according to the target variables. Accordingly, the variable under imputation in the process of filling missing data is the response element, while the others are independent elements. By default, the MICE algorithm uses predictive mean matching for continuous variables, while logistic regression is used for roughly bifurcated variables. ³¹

CPE-7393-FIG-0004-c — A schematic representation of how the MICE method works

2.2. Data balancing

The datasets in which the classes are not evenly distributed in the dataset and the data amount differences between the classes are high are called unbalanced datasets. It has an impact on the classification model of unbalanced datasets, particularly training data. Indeed, unbalanced data causes the classification model to tend to the majority class, reducing the performance of the model. ²² , ³² , ³³ For this reason, most of the machine learning algorithms does not give reliable results in studies with unbalanced datasets, ignoring the uneven distribution in classrooms. Therefore, data balancing preprocess was performed using the SMOTE method in the research.

2.2.1. Sampling with SMOTE

There are two main ways of data balancing, namely, undersampling, and oversampling. As seen in Figure 5, deleting enough part of the data of the high‐rate class with various algorithms to balance the data of the low‐rate class is called undersampling. The opposite of this, oversampling, is the process of providing a balance to the high rate class by generating new data from the low rate of class data using various algorithms. This method does not destroy the natural structure of the dataset by generating artificial samples, ³⁴ rather than merely replicating existing observations.

CPE-7393-FIG-0005-c — Schematic representation of oversampling and undersampling processes

The SMOTE ¹² , ²² , ²⁶ method is an undersampling balancing technique that produces artificial samples ³⁵ based on the closest k neighbors of the samples examined in the minority class. The SMOTE method strengthens the classifier's ability to learn minority classes, as it provides more relevant minority class examples for training process. The risk of overfitting faced by random oversampling can be avoided with the SMOTE algorithm. ³⁶ For this reason, current oversampling methods based on SMOTE can yield relatively better results than the original one. ³⁷ Schematic representation of synthetic samples created by SMOTE technique was shown in Figure 6.

CPE-7393-FIG-0006-c — Schematic representation of synthetic samples created by SMOTE technique

Within the scope of this study, five different training and testing sets were obtained by separating the data whose missingness was filled with KNN and MICE methods randomly as 80% and 20% for training and test, respectively. Subsequently, 142 samples in the positive class of each training set and 82 samples in the negative class were balanced to be 142 positive and 142 negatives using the SMOTE technique.

2.3. Classification and evaluation processes

Ensemble Learning is a classification technique created by combining multiple modeling algorithms to get better results than the performance that can be obtained from a single classifier model. And since ensemble learning methods are used to reduce variance with bagging, reduce bias with boosting and obtain more successful predictions with stacking, ³⁸ they achieve more successful results than many classifier techniques. In this direction, the classification success was examined by testing different seed parameter values between 1 and 25 of the ensemble learning methods bagging, AdaBoost, and random forest classifiers over datasets containing “1” class information for infected individuals and “0” class information for noninfected individuals. The average classifier success value has been reached by taking the average of these results. Additionally, KNN classifier (KNNC), support vector machine (SVM), logistic regression, artificial neural network (ANN), and decision tree ³⁹ , ⁴⁰ classical classifiers were also included in the classification, likewise taking the average results, and a comparative analysis was made with ensemble learning methods.

Achieved accuracy was evaluated with certain performance criteria. ²² , ⁴¹ , ⁴² They are the total number of instances (TNI), correctly classified instances (CCI), Kappa coefficient, precision, recall, Matthew correlation coefficient (MCC), area under the ROC curve (AUC), and average classification accuracy (CA).

3. RESULTS

This study was performed using the measurements done on the blood samples of 279 patients. Two subgroups of data were obtained by imputing the missing data points using KNN or MICE methods. As seen in Figure 7, for each subgroup five different training and testing sets were generated randomly from 80% and 20% of the data, respectively. As a result of the training and testing sets generating process, five different training sets with 224 data and their testing sets with 55 data were obtained.

CPE-7393-FIG-0007-b — Generating for each sub‐dataset, randomly 80% training and 20% testing sets

Classification was carried out using the 10‐fold cross validation (CV) technique; either before or after SMOTE was applied to the training sets. The CV technique is a statistical resampling method developed to make an accurate evaluation from randomly generated training and testing sets. As presented in Figure 8, in 10‐fold CV the training set is divided into 10 different subgroups. This process is repeated 10 times iteratively, using one group of the divided training set as the validation set and the remaining nine groups as the training set. In each iteration, the model is established using the training set and this model is evaluated with the validation set. The average of the evaluation results obtained in each iteration gives the most accurate performance of the training set to be used in classification. Five different training sets obtained for each sub‐dataset were evaluated with the CV technique. The training model was generated by using 202 of the 224 data in the training sets as the training set, and the evaluation of this model was carried out with the validation set with the remaining 22 data. The performance obtained as a result of 10 iterations gave the success of the training set. After the most effective success of the training set was achieved, tests were carried out on this training model.

CPE-7393-FIG-0008-b — 10‐fold cross validation (CV) technique

The tests were performed using the classifier parameters that provided the highest accuracy. The averages of the statistically calculated performance results were taken. Average test results obtained from the datasets imputed by the KNN method are shown in Table 3. As can be seen from Table 3, the highest average CA ratio reached with bagging classifier without SMOTE was 83.91%.

TABLE 3.

Average classification results of the dataset filled missing data with KNN (TNI = 55)

	Classifier	˜CCI	Kappa	Precision	Recall	MCC	AUC	CA
Without SMOTE method	KNNC	36	0.2455	65.10	65.50	0.3500	0.6210	65.45
	SVM	43	0.4839	78.34	77.84	0.5070	0.7250	77.61
	ANN	43	0.5075	77.90	78.20	0.5140	0.8200	78.18
	Decision tree	41	0.3984	74.60	74.50	0.4240	0.6820	74.54
	Logistic regression	42	0.4651	75.72	76.0	0.4700	0.8026	75.99
	Bagging	46	0.6376	83.99	83.88	0.6455	0.8873	83.91
	AdaBoost	46	0.6316	83.65	82.95	0.6377	0.8831	83.01
	Random forest	46	0.6176	83.18	83.18	0.6291	0.8790	83.16
With SMOTE method	KNNC	33	0.1538	60.90	60.0	0.3176	0.5790	60.0
	SVM	42	0.5034	77.08	77.08	0.5040	0.7514	77.08
	ANN	43	0.5479	79.70	78.20	0.5540	0.8500	78.18
	Decision tree	34	0.2007	63.10	61.80	0.2020	0.6040	61.81
	Logistic regression	42	0.4947	75.66	76.36	0.4950	0.7942	76.36
	Bagging	46	0.6367	83.31	83.44	0.6382	0.8807	83.42
	AdaBoost	45	0.6225	82.17	82.91	0.6276	0.8658	82.90
	Random forest	45	0.6314	81.93	81.38	0.6225	0.8696	82.68

Open in a new tab

When Table 3 is examined, the highest CA ratio was reached with 83.42%, again thanks to the same classifier by applying SMOTE. Since the MCC criterion is the correlation coefficient between the observed and predicted binary classifications, its proximity to 1 indicates the high accuracy of the estimation. The best value 0.6455 for MCC was reached for the dataset obtained without using SMOTE (Table 3). MCC value for the dataset obtained using SMOTE was found to be 0.6382. Although there is a slight difference between the classification success and other statistical parameters, both datasets had a Kappa coefficient of 0.6376. In addition, with the bagging classifier 46 out of 55 data were correctly predicted for both datasets. One of the disadvantages of SMOTE is that it generates noisy and outlier data. Also, due to the use of the K‐nearest neighbor (KNN) algorithm, it can generate closely spaced data that do not adequately represent the entire distribution in the feature space. ⁴³ Overall, it was concluded that SMOTE application did not have a positive effect on the results for the dataset imputed with KNN.

Table 4 shows the average test results obtained from the datasets imputed with MICE. With MICE imputation, the highest average CA ratio reached with the bagging classifier was 83.74% by applying SMOTE. In addition, comparing all classifiers indicates that the highest values are reached for the dataset obtained by applying SMOTE. This proves that the MICE method in combination with the SMOTE technique gives a very successful result in terms of COVID‐19 data.

TABLE 4.

Average classification results of the dataset filled missing data with MICE (TNI: 55)

	Classifier	˜CCI	Kappa	Precision	Recall	MCC	AUC	CA
Without SMOTE method	KNNC	36	0.3192	70.02	65.50	0.3370	0.6790	65.45
	SVM	43	0.4882	78.84	77.82	0.5142	0.7292	77.81
	ANN	40	0.4291	73.80	72.70	0.4320	0.7970	72.72
	Decision tree	36	0.2915	67.70	65.50	0.2970	0.6540	65.45
	Logistic regression	44	0.5494	79.80	79.52	0.5560	0.8350	79.53
	Bagging	45	0.6070	81.84	81.65	0.6050	0.8565	81.66
	AdaBoost	44	0.5866	81.60	81.43	0.5883	0.8602	81.43
	Random forest	45	0.6016	82.21	82.02	0.6085	0.8730	82.02
With SMOTE method	KNNC	36	0.3192	70.02	65.50	0.3370	0.6790	65.45
	SVM	43	0.5498	79.62	79.07	0.5547	0.7740	79.08
	ANN	37	0.3734	74.70	67.30	0.4110	0.7610	67.27
	Decision tree	30	0.0283	55.0	54.50	0.0280	0.5140	54.54
	Logistic regression	43	0.5286	78.20	78.20	0.5290	0.8570	78.18
	Bagging	46	0.6376	84.04	83.81	0.6416	0.8751	83.74
	AdaBoost	45	0.6260	82.99	82.69	0.6290	0.8708	82.69
	Random forest	45	0.6202	82.73	82.36	0.6294	0.8688	82.41

Open in a new tab

When Tables 3 and 4 are examined together, it can be seen that the classifier model that brings the highest statistical results for both data subsets is usually the bagging classifier. In addition, in terms of the Kappa coefficient, it is seen that the bagging algorithm achieves the best results in the analysis of all datasets. This shows that the bagging classifier approaches perfect agreement than the other classifiers due to the Kappa coefficient approaching the value of 1. Whereas bagging achieved the highest success without applying SMOTE for the dataset that was imputed with KNN, it achieved the highest success by applying SMOTE for the dataset that was imputed with MICE. This proved that matching the data with a suitable preprocessing technique, that is the reality of the integrated dataset‐data preprocessing combination aimed in the study, is clearly an effective factor in the success. Among all data groups, the highest average CA success was obtained with the bagging classifier, and the highest achievements following this success for AdaBoost and random forest were 83.01% and 83.16%, respectively. This obtained output reveals that the application of certain data preprocessing techniques presented in the study to the blood parameters used for COVID‐19 detection yielded highly accurate results for ensemble learning methods, and showed a successful data preprocessing combination.

3.1. Comparative analysis of models based on hypothesis testing

Above, the comparisons between each other the models, classifiers, and methods applied to the data which were carried out within the scope of the study and whose results were given, are given. However, essentially it is important for the reliability of the results of the study results that we estimate the how accuracy and reliability of the similarities or differences between the performances of these comparisons or the models presented. Statistical hypothesis testing is needed in order to find the answer to the question of whether the results obtained came about by chance and to express the obtained results more valid.

In this study, paired t‐test was used to make the comparison between the classifiers used and the results of the methods applied to the data more valid. This test procedure is the most common statistical hypothesis test known for performance comparison in our study. Within the scope of the test; H0 or null hypothesis means that there is no difference between the two classifiers or the methods applied to the datasets. In other words; this means that those who were compared with the hypothesis had the same performance. In addition, the significance level was chosen as 5% (0.05) (two‐tailed), and analyzes and interpretations were made accordingly. In case the H0 hypothesis is rejected, the analysis results were expressed as significantly better (sb) and significantly worse (sw) at the 5% level of statistical significance. Its acceptance status was shown with significantly same (ss).

The statistical comparisons of the preprocessing techniques applied to the datasets and the classifier performances used in the study are given in Tables 5, 6, 7, 8, and 9 and interpreted. Among these, in Table 5 below, the comparative hypothesis analyzes of the classifiers on a dataset are given based on the bagging classifier, whose performance was determined to be the best.

TABLE 5.

Classifier‐based comparative hypothesis test analysis results on datasets

Dataset	Base classifier	AdaBoost	Random forest	KNNC	SVM	LR	ANN	DT
KNN‐without SMOTE method	Bagging	ss	ss	sw	sw	sw	sw	sw
KNN‐with SMOTE method		sw	sw	sw	sw	sw	sw	sw
MICE‐without SMOTE method		sw	ss	sw	sw	sw	sw	sw
MICE‐with SMOTE method		sw	sw	sw	sw	sw	sw	sw

Open in a new tab

TABLE 6.

Comparative hypothesis test analysis results of SMOTE‐balanced and unbalanced models of datasets with missing data completed with KNN

	Dataset‐method
Classifiers	KNN‐without SMOTE method	KNN‐with SMOTE method
Bagging		sw
AdaBoost		sw
Random forest		sw
KNNC		sw
SVM		sw
Logistic regression		ss
ANN		sw
Decision tree		sw

Open in a new tab

TABLE 7.

Comparative hypothesis test analysis results of SMOTE‐balanced and unbalanced models of datasets with missing data completed with MICE

	Dataset‐method
Classifiers	MICE‐without SMOTE method	MICE‐with SMOTE method
Bagging		sb
AdaBoost		sb
Random forest		ss
KNNC		ss
SVM		ss
Logistic regression		ss
ANN		sw
Decision tree		sw

Open in a new tab

TABLE 8.

Comparative hypothesis test analysis results of KNN and MICE methods of unbalanced datasets

	Dataset‐method
Classifiers	KNN‐without SMOTE method	MICE‐without SMOTE method
Bagging		sw
AdaBoost		sw
Random forest		sw
KNNC		sw
SVM		ss
Logistic regression		sb
ANN		ss
Decision tree		ss

Open in a new tab

TABLE 9.

Comparative hypothesis test analysis results of KNN and MICE methods of balanced datasets with SMOTE

	Dataset‐method
Classifiers	KNN‐with SMOTE method	MICE‐with SMOTE method
Bagging		ss
AdaBoost		ss
Random forest		ss
KNNC		sb
SVM		sb
Logistic regression		sb
ANN		sw
Decision tree		sb

Open in a new tab

When the hypothesis analysis results in Table 5 are examined; only for the KNN‐without SMOTE Method dataset, as a result of the analysis of the bagging classifier with AdaBoost and random forest, the H0 hypothesis was accepted as significant at a 5% significance level. For the MICE‐without SMOTE method, the null hypothesis was not rejected only between bagging and random forest. Despite the cases where the H0 hypothesis, which indicates that the classifiers have the same performance, is accepted, the Kappa, precision, recall, MCC, and AUC evaluation criteria show that the bagging classifier has a slightly better performance in general. For the KNN‐with SMOTE method and MICE‐with SMOTE method datasets, the null hypothesis was rejected at a 5% significance level as a result of the hypothesis analysis between the bagging classifier and the others. In other words, the bagging classifier performed statistically better than the others. In Tables 6, 7, 8, and 9 below, a classifier‐based statistical comparative analysis of the relevant data groups and applied methods is presented. In Tables 6, 7, 8, and 9 below, a classifier‐based statistical comparative analysis of the relevant data groups and applied methods is presented.

Comparative hypothesis analyzes in Tables 6 and 7 were carried out in four main sections based on dataset‐method‐classifier. The first of these was made between KNN‐without SMOTE method and KNN‐with SMOTE method and the results are presented in Table 6. If only the logistic regression classifier is used, the H0 hypothesis is accepted at the 5% significance level, indicating that there is no difference between these two groups. For all of the other classifiers, the KNN‐without SMOTE method group was found to be significantly better. The second statistical comparison was made between MICE‐without SMOTE method and MICE‐with SMOTE method, and the results are presented in Table 7. As seen in Table 7, the null hypothesis among these groups was accepted for random forest, KNN, SVM, and logistic regression classifiers. For ANN and decision tree, the H0 hypothesis was rejected at a 5% significance level and MICE‐with SMOTE method was found to be significantly worse, but it was significantly better for the bagging and AdaBoost classifiers.

As another comparison analysis, as seen in Table 8 and Table 9, hypothesis analysis of KNN and MICE methods is realized by implementing between with‐SMOTE and without‐SMOTE datasets. In Table 8, the results of the comparative analysis performed between the KNN‐without SMOTE method and the MICE‐without SMOTE method were given. When these results were examined, it was understood that there was no difference between these two groups for the SVM, ANN, and decision tree classifiers by accepting the H0 hypothesis at the 5% significance level. Among the five classifiers for which the null hypothesis was rejected, it was determined only MICE‐without SMOTE method was significantly better for logistic regression but significantly worse for bagging, AdaBoost, random forest, and KNNC. The last of these comparisons were made between KNN‐with SMOTE method and MICE‐with SMOTE method. According to this analysis; For bagging, AdaBoost, and random forest, it was determined that there was no difference between the two groups at the 5% significance level. While KNN‐with SMOTE method is significantly better for ANN classifier only; MICE‐with SMOTE method was found to be significantly better for KNNC, SVM, logistic regression, and decision tree.

When all analyzes carried out within the scope of the study and presented with five tables are evaluated in general;

Although bagging, AdaBoost, and random forest have similar results among the classifiers, it was seen bagging mostly performs slightly better in terms of other evaluation criteria.
For the classifiers with the best results (bagging, AdaBoost, and random forest), it is understood that there is no statistically significant difference at the 5% significance level between KNN and MICE when the SMOTE method is applied. However, in the case where the SMOTE method was not applied, the KNN technique was found to be significantly better for the same classifiers. For other comparative hypotheses analyzes other than these, the results vary.
When the general case analysis is made in terms of without‐SMOTE and with‐SMOTE methods, it is understood that there is a 5% significant difference between these two methods for the bagging classifier, which generally shows the highest performance. According to this; while the without‐SMOTE method was significantly better for KNN, the with‐SMOTE method was found to be significantly better for MICE.

4. DISCUSSION

It is known that the transmission rate of COVID‐19 is directly related to the degree of contact between people and objects. The active use of technology in a wide range of areas is of critical importance for minimizing the spread of the virus. One such important application is the use of artificial intelligence‐based expert systems in the health sector. If the follow‐up could be done momentarily through the use of artificial intelligence systems, it will be easier to control the spread of this disease. The RT‐PCR test applied first in the diagnosis phase of COVID‐19 has disadvantages, such as the time it takes till the results are available, and the lack of reliability of its results. As a more precise alternative, diagnosis can be carried out using computed tomography. However, the requirement that it has to be examined by a radiologist again creates a bottleneck until the results are available. Artificial intelligence‐based studies are carried out especially to minimize the drawbacks caused by the time it takes to diagnose a patient. ¹⁴ , ¹⁵ If the diagnosis period of this disease were faster and more reliable, which the whole world has been struggling with, the rate of spread of the epidemic could also be controlled better. Correspondingly, the risk could be minimized for the elderly and people with chronic diseases, who are defined as the “risk groups” in the society. For all these reasons, it is imperative for the machine learning and artificial intelligence‐based expert systems to actively take their place at the forefront of this struggle.

In this study, the importance of using artificial intelligence‐based systems together with certain preprocessing steps in the diagnosis of COVID‐19 was investigated using the data obtained from 279 individuals, who applied to a healthcare institution with suspected COVID‐19. To that end, encoding categorical feature was carried out first with the one‐hot encoding method. Then, min‐max feature scaling operation was performed from the range of [0–1] and missingness of the existing dataset was eliminated with KNN and MICE methods. Two separate data subsets were generated from these imputed data, for each of which five random data groups were created separately. The data were balanced by applying the SMOTE method to the training sets of each data group created. Finally, both SMOTE applied and non‐applied data groups were classified with 25 different seed parameter values of bagging, AdaBoost and random forest algorithms. The average of these results generated with 25 different seed parameters was taken, and the average classifier success value was calculated for five data groups. According to these results, it was observed that high success rates can be achieved in the detection of infected individuals by using appropriate data preprocessing techniques using the blood test results for diagnosing COVID‐19. Besides, the data used in this study was also used in the COVID‐19 study conducted by Brinati et al. In this study by Brinati et al., two machine learning models have been developed to discriminate between patients who are either positive or negative to the COVID‐19. In addition, when the study of Aljame et al. is examined, it is seen that an ensemble learning model for the diagnosis of COVID‐19 is presented with the COVID‐19 dataset of the Albert Einstein Hospital in Brazil, which has 5644 patients. Finally, in the study carried out by Cabitza et al., it is seen that five different machine learning models have been developed for the diagnosis of COVID‐19 using the OSR COVID‐19 dataset with 1624 patients, COVID‐specific dataset, and CBC dataset. If the results of the study are discussed in general, it reveals that a successful combination of data preprocessing for ensemble learning methods is formed by applying the data preprocessing techniques presented for the study to the blood parameters used for COVID‐19 detection.

Analyzing the other studies, the development of the most successful classifier model for the detection of COVID‐19, in general, has been emphasized. Different from the previous research, our study is that the data preprocessing techniques that will perform effectively in these datasets are determined for COVID‐19 studies to be carried out with datasets containing blood data to provide a successful machine learning model. And in this direction, we present an integrated dataset‐data preprocessing combination using blood parameters to develop a successful classifier model in the detection of COVID‐19. Consequently, a new direction for future studies by demonstrating the importance of matching the datasets with suitable preprocessing techniques is provided in this study.

5. THREATS TO VALIDITY

In this section, the threats to the validity of our research were discussed. In this context, these threats were handled in three stages: construct, internal, and external.

Construct validity: Threats to this validity are related to the suitability of the baseline assessment measures used in the evaluation and comparison of the classifiers we used in our research. In our study, we used the CA criterion to evaluate and compare the performance outputs of the classifier algorithms. It has been seen that this criterion is used and preferred in most of the studies in which classifiers are used and the results are analyzed in the literature. For this reason, the CA criterion was preferred both because other evaluation criteria cause different interpretations and results and because different results that can be differentiated too much between others do not occur. In addition, whether the performance differences of the algorithms used within the scope of the study are significant or not were interpreted by controlling them with statistical hypothesis tests.

Apart from all these, in our research, an integrated dataset‐data preprocessing combination using blood parameters is presented to develop a successful classifier model for COVID‐19 detection. In short, in conclusion, the importance of matching the datasets with appropriate preprocessing techniques has been revealed and a new direction has been given to future studies. In addition, in order to minimize the threats mentioned for this purpose, future studies should focus on only one classifier algorithm and the effects of changing or optimizing many parameters of this classifier in very small steps on the results should be examined in detail.

Internal validity: Threats at this stage are more directly related to incorrect, erroneous, and unclear datasets. Because it is a well‐known fact that wrong data can make even classifiers that work perfectly problematic. If scientists working in this field cannot fully analyze the error in the data they have, they will look for the problem in the classifiers they use and unsuccessful/erroneous results will be obtained. Although we researchers try to check for error/inaccurate conditions during the first recording of the datasets we will study, we can never be sure that they are correct/error‐free. As stated in this study, the inaccessibility of complete/correct/inaccurate data was seen as a factor that negatively affected the classifier's performance. In addition, in our study, the imbalance in the distribution of positive/negative patients in the dataset was also seen as another situation that negatively affected the results. As a result, it would be beneficial for researchers to further investigate the inaccuracies/errors/incompleteness of the data in the future in order to minimize the threat to internal validity.

External validity: Threats to this part are more directly related to the generalizability of the performance outputs. It is possible that the classifier performances obtained from the dataset we used in our study are not the same for different COVID‐19 datasets. In this research, the effects of data preprocessing techniques were examined in order to overcome many threats and limitations and to minimize the performance degradation caused by them. As a result of these applications, the positive effects of the techniques used have been demonstrated, albeit limited. However, we think that one factor affecting the accuracy and generalizability of the classifier outputs is working on too much data. In future studies, the inclusion of computerized tomography data in addition to blood counts used for the detection of COVID‐19 will make the results more valid and general.

6. CONCLUSION

In this study, it is aimed to determine the combination of COVID‐19 blood parameters dataset and suitable data preprocessing techniques to develop a successful classifier model. In experimental studies, the application results of data preprocessing techniques have been analyzed for the detection of COVID‐19 from blood parameter values, and the importance of choosing the suitable combination of data preprocessing techniques has been shown.

Studies to develop artificial intelligence‐based COVID‐19 diagnostic systems have accelerated in the last year. However, it is seen that the time taken to test the reliability of these systems is insufficient. This is the biggest limitation of the studies carried out with these systems. In addition, the inaccessibility of complete patient data is another disadvantage that lowers the accuracy of the studies. Another limitation is the imbalance of positive and negative patient distribution in the obtained data. This imbalance directly affects the calculated performance criteria. In this study, the effect of data preprocessing techniques is examined to overcome these limitations. This study will be a guide for detection of COVID‐19 from blood parameters. Finally, the increase of the data will enable more accurate determinations. In future studies, the results of computed tomography and blood parameters can be evaluated together by using artificial intelligence methods, and more successful COVID‐19 detections can be performed.

CONFLICT OF INTEREST

The authors have declared no conflict of interest.

INFORMED CONSENT STATEMENT

As stated in Reference 5; individuals signed an informed consent authorizing the use of their anonymously collected data for retrospective observational studies (article 9.2.j; EU general data protection regulation 2016/679 [GDPR]), according to the IRCCS San Raffaele Hospital policy (IOG075/2016), and the appropriate institutional forms have been archived.

Erol G, Uzbaş B, Yücelbaş C, Yücelbaş Ş. Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID‐19. Concurrency Computat Pract Exper. 2022;34(28):e7393. doi: 10.1002/cpe.7393

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available at reference.

REFERENCES

1. World Health Organization (WHO) . Health topics, coronavirus; 2020. https://www.who.int/healthtopics/coronavirus#tab=tab_3
2. World Health Organization (WHO) . WHO coronavirus (COVID‐19) dashboard; 2022. Retrieved from https://covid19.who.int
3. Day M. Covid‐19: identifying and isolating asymptomatic people helped eliminate virus in Italian village. BMJ. 2020;368(1165):1. doi: 10.1136/bmj.m1165 [DOI] [PubMed] [Google Scholar]
4. Alkhodari M, Khandoker AH. Detection of COVID‐19 in smartphone‐based breathing recordings: a pre‐screening deep learning tool. PLoS ONE. 2020;17(1):e0262448. doi: 10.1371/journal.pone.0262448 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Brinati D, Campagner A, Ferrari D, Locatelli M, Banfi G, Cabitza F. Detection of COVID‐19 infection from routine blood exams with machine learning: a feasibility study. J Med Syst. 2020;44:135. doi: 10.1007/s10916-020-01597-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Li Z, Yi Y, Luo X, et al. Development and clinical application of a rapid IgM‐IgG combined antibody test for SARS‐CoV ‐2 infection diagnosis. J Med Virol. 2020;92:1518‐1524. doi: 10.1002/jmv.25727 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Mei X, Lee HC, Diao K, et al. Artificial intelligence–enabled rapid diagnosis of patients with COVID‐19. Nat Med. 2020;26(8):1224‐1228. doi: 10.1038/s41591-020-0931 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Demirdal T, Sen P. The significance of neutrophil‐lymphocyte ratio, platelet‐lymphocyte ratio and lymphocyte‐monocyte ratio in predicting peripheral arterial disease, peripheral neuropathy, osteomyelitis and amputation in diabetic foot infection. Diabetes Res Clin Pract. 2018;7(144):118‐125. doi: 10.1016/j.diabres.2018.08.009 [DOI] [PubMed] [Google Scholar]
9. Jan HC, Yang WH, Ou CH. Combination of the preoperative systemic immune‐inflammation index and monocyte‐lymphocyte ratio as a novel prognostic factor in patients with upper‐tract urothelial carcinoma. Ann Surg Oncol. 2019;26(2):669‐684. doi: 10.1245/s10434-018-6942-3 [DOI] [PubMed] [Google Scholar]
10. Sun S, Cai X, Wang H, et al. Abnormalities of peripheral blood system in patients with COVID‐19 in Wenzhou, China. Clin Chim Acta. 2020;507:174‐180. doi: 10.1016/j.cca.2020.04.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Guan W, Ni Z, Hu Y, et al. Clinical characteristics of coronavirus disease 2019 in China. New Engl J Med. 2019;382(18):1708‐1720. doi: 10.1056/NEJMoa2002032 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Göreke V, Sarı V, Kockanat S. A novel classifier architecture based on deep neural network for COVID‐19 detection using laboratory findings. Appl Soft Comput. 2021;106(1):107329. doi: 10.1016/j.asoc.2021.107329 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Ferrari D, Motta A, Strolla M, Banfi G, Locatelli M. Routine blood tests as a potential diagnostic tool for COVID‐19. Clin Chem Lab Med. 2020;58(7):1095‐1099. doi: 10.1515/cclm-2020-0398 [DOI] [PubMed] [Google Scholar]
14. Clifton C. Data mining computer science; 2018, Retrieved from https://www.britannica.com/technology/data‐mining
15. Gill JK. Data preparation, preprocessing, wrangling in deep learning‐Xenon stack; 2018. Retrieved from www.xenonstack.com.
16. Durairaj M, Sathyavathi T. Applying rough set theory for medical informatics data analysis. Int J Sci Res Comput Sci Eng. 2013;1(5):1‐8. [Google Scholar]
17. Luo Y. Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform. 2022;23(1). doi: 10.1093/bib/bbab489 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Durairaj M, Ramasamy N. A comparison of the perceptive approaches for preprocessing the data set for predicting fertility success rate. Int J Control Theory Appl. 2016;9(27):255‐260. [Google Scholar]
19. Famili F, Shen WM, Weber R, Simoudis E. Data pre‐processing and intelligent data analysis. Intell Data Anal. 1997;1(1):3‐23. doi: 10.1016/S1088-467X(98)00007-9 [DOI] [Google Scholar]
20. Ruperez MJ, Martin‐Guerrero JD, Monserrat C, Alcaniz M. Artificial neural networks for predicting dorsal pressure on the foot surface while walking. Expert Syst Appl. 2011;39:5349‐5357. doi: 10.1016/j.eswa.2011.11.050 [DOI] [Google Scholar]
21. Almuhaideb S, Menai MEB. Impact of preprocessing on medical data classification. Front Comp Sci. 2016;10(6):1082‐1102. doi: 10.1007/s11704-016-5203-5 [DOI] [Google Scholar]
22. Aljame M, Ahmad I, Imtiaz A, Mohammed A. Ensemble learning model for diagnosing COVID‐19 from routine blood tests. Inform Med Unlocked. 2020;21:100449. doi: 10.1016/j.imu.2020.100449 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Cabitza F, Campagner A, Ferrari D, Resta C, et al. Development, evaluation, and validation of machine learning models for COVID‐19 detection based on routine blood tests. Clin Chem Lab Med. 2021;59(2):421‐431. doi: 10.1515/cclm-2020-1294 [DOI] [PubMed] [Google Scholar]
24. Mathiak B, Eckstein S. Five steps to text mining in biomedical literature five steps to text mining in biomedical literature. Proceedings of the 15th European Conference on Machine Learning and the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases; 2004:47‐50. https://www.academia.edu/27467487/Five_Steps_to_Text_Mining_in_Biomedical_Literature
25. Aljameel SS, Khan IU, Aslam N, Aljabri M, Alsulmi ES. Managing big data, visualization and its analytics in healthcare based on scientific programming 2021. Sci Program. 2021;2021:5587188. doi: 10.1155/2021/5587188 [DOI] [Google Scholar]
26. Chadaga K, Prabhu S, Umakanth S, et al. COVID‐19 mortality prediction among patients using epidemiological parameters: an ensemble machine learning approach. Eng Sci. 2021;16:221‐233. doi: 10.30919/es8d579 [DOI] [Google Scholar]
27. Seger C. An investigation of categorical variable encoding techniques in machine learning: binary versus one‐hot and feature hashing. School of Electrical Engineering and Computer Science (EECS); 2018. Click here to enter text.https://www.diva‐portal.org/smash/get/diva2:1259073/FULLTEXT01.pdf
28. Albayrak M, Turhan K, Kurt B. A missing data imputation using clustering and maximum likelihood estimation. Proceedings of the 2017 Medical Technologies National Congress (TIPTEKNO); 2017:242‐245: IEEE. doi: 10.1109/TIPTEKNO.2017.8238064 [DOI]
29. Jadhav A, Pramod D, Ramanathan K. Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell. 2019;33:913‐933. doi: 10.1080/08839514.2019.1637138 [DOI] [Google Scholar]
30. Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work ? Int J Methods Psychiatr Res. 2011;20(1):40‐49. doi: 10.1002/mpr.329 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Wiysobunri Z. Multiple imputation with multivariate imputation by chained equation (MICE) package. Ann Transl Med. 2016;4(2):30. doi: 10.3978/j.issn.2305-5839.2015.12.63 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Shamsolmoali P, Zareapoor M, Shen L, Sadka AH, Yang J. Imbalanced data learning by minority class augmentation using capsule adversarial networks. Neurocomputing. 2020;459:481‐493. doi: 10.1016/j.neucom.2020.01.119 [DOI] [Google Scholar]
33. Zhu R, Guo Y, Xue JH. Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognit Lett. 2020;133:217‐223. doi: 10.1016/j.patrec.2020.03.004 [DOI] [Google Scholar]
34. Douzas G, Bação F, Last F. Improving imbalanced learning through a heuristic oversampling method based on k‐means and SMOTE. Inform Sci. 2018;465:1‐20. doi: 10.1016/j.ins.2018.06.056 [DOI] [Google Scholar]
35. Yavaş M, Güran A, Uysal M. Classification of Covid‐19 dataset by applying Smote‐based sampling technique. Eur J Sci Technol. 2020; Special Issue:258‐264. doi: 10.31590/ejosat.779952 [DOI] [Google Scholar]
36. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: synthetic minority over‐sampling technique. J Arti Intell Res. 2002;16:321‐357. doi: 10.48550/arXiv.1106.1813 [DOI] [Google Scholar]
37. Xia W, Ma C, Liu J, et al. High‐resolution remote sensing imagery classification of imbalanced data using multistage sampling method and deep neural networks. Remote Sens. 2019;11(21):2523. doi: 10.3390/rs11212523 [DOI] [Google Scholar]
38. Wiysobunri B, Erden H, Toreyin B. An ensemble deep learning system for the automatic detection of COVID‐19 in x‐ray images. 2020. https://spacing.itu.edu.tr/pdf/beltus‐ytb.pdf
39. Alballa N, Al‐Turaiki I. Machine learning approaches in COVID‐19 diagnosis, mortality, and severity risk prediction: a review. Inform Med Unlocked. 2021;24:100564. doi: 10.1016/j.imu.2021.100564 [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Muhammad LJ, Algehyne EA, Usman SS, Ahmad A, Chakraborty C, Mohammed IA. Supervised machine learning models for prediction of COVID‐19 infection using epidemiology dataset. SN Comput Sci. 2021;2(1):11. doi: 10.1007/s42979-020-00394-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Seliya N, Khoshgoftaar TM, Hulse JM. A study on the relationships of classifier performance metrics. Proceedings of the 2009 21st IEEE International Conference on Tools with Artificial Intelligence; 2009; IEEE. Click here to enter text.doi: 10.1109/ICTAI.2009.25 [DOI]
42. Osareh A, Shadgar B. Machine learning techniques to diagnose breast cancer. Proceedings of the 2010 5th International Symposium on Health Informatics and Bioinformatic; 2010.Click here to enter text. doi: 10.1109/HIBIT.2010.5478895 [DOI]
43. Donyavi Z, Asadi S. Diverse training dataset generation based on a multi‐objective optimization for semi‐supervised classification. Pattern Recogniti. 2020;108:107543. doi: 10.1016/j.patcog.2020.107543 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available at reference.

[cpe7393-bib-0001] 1. World Health Organization (WHO) . Health topics, coronavirus; 2020. https://www.who.int/healthtopics/coronavirus#tab=tab_3

[cpe7393-bib-0002] 2. World Health Organization (WHO) . WHO coronavirus (COVID‐19) dashboard; 2022. Retrieved from https://covid19.who.int

[cpe7393-bib-0003] 3. Day M. Covid‐19: identifying and isolating asymptomatic people helped eliminate virus in Italian village. BMJ. 2020;368(1165):1. doi: 10.1136/bmj.m1165 [DOI] [PubMed] [Google Scholar]

[cpe7393-bib-0004] 4. Alkhodari M, Khandoker AH. Detection of COVID‐19 in smartphone‐based breathing recordings: a pre‐screening deep learning tool. PLoS ONE. 2020;17(1):e0262448. doi: 10.1371/journal.pone.0262448 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0005] 5. Brinati D, Campagner A, Ferrari D, Locatelli M, Banfi G, Cabitza F. Detection of COVID‐19 infection from routine blood exams with machine learning: a feasibility study. J Med Syst. 2020;44:135. doi: 10.1007/s10916-020-01597-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0006] 6. Li Z, Yi Y, Luo X, et al. Development and clinical application of a rapid IgM‐IgG combined antibody test for SARS‐CoV ‐2 infection diagnosis. J Med Virol. 2020;92:1518‐1524. doi: 10.1002/jmv.25727 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0007] 7. Mei X, Lee HC, Diao K, et al. Artificial intelligence–enabled rapid diagnosis of patients with COVID‐19. Nat Med. 2020;26(8):1224‐1228. doi: 10.1038/s41591-020-0931 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0008] 8. Demirdal T, Sen P. The significance of neutrophil‐lymphocyte ratio, platelet‐lymphocyte ratio and lymphocyte‐monocyte ratio in predicting peripheral arterial disease, peripheral neuropathy, osteomyelitis and amputation in diabetic foot infection. Diabetes Res Clin Pract. 2018;7(144):118‐125. doi: 10.1016/j.diabres.2018.08.009 [DOI] [PubMed] [Google Scholar]

[cpe7393-bib-0009] 9. Jan HC, Yang WH, Ou CH. Combination of the preoperative systemic immune‐inflammation index and monocyte‐lymphocyte ratio as a novel prognostic factor in patients with upper‐tract urothelial carcinoma. Ann Surg Oncol. 2019;26(2):669‐684. doi: 10.1245/s10434-018-6942-3 [DOI] [PubMed] [Google Scholar]

[cpe7393-bib-0010] 10. Sun S, Cai X, Wang H, et al. Abnormalities of peripheral blood system in patients with COVID‐19 in Wenzhou, China. Clin Chim Acta. 2020;507:174‐180. doi: 10.1016/j.cca.2020.04.024 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0011] 11. Guan W, Ni Z, Hu Y, et al. Clinical characteristics of coronavirus disease 2019 in China. New Engl J Med. 2019;382(18):1708‐1720. doi: 10.1056/NEJMoa2002032 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0012] 12. Göreke V, Sarı V, Kockanat S. A novel classifier architecture based on deep neural network for COVID‐19 detection using laboratory findings. Appl Soft Comput. 2021;106(1):107329. doi: 10.1016/j.asoc.2021.107329 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0013] 13. Ferrari D, Motta A, Strolla M, Banfi G, Locatelli M. Routine blood tests as a potential diagnostic tool for COVID‐19. Clin Chem Lab Med. 2020;58(7):1095‐1099. doi: 10.1515/cclm-2020-0398 [DOI] [PubMed] [Google Scholar]

[cpe7393-bib-0014] 14. Clifton C. Data mining computer science; 2018, Retrieved from https://www.britannica.com/technology/data‐mining

[cpe7393-bib-0015] 15. Gill JK. Data preparation, preprocessing, wrangling in deep learning‐Xenon stack; 2018. Retrieved from www.xenonstack.com.

[cpe7393-bib-0016] 16. Durairaj M, Sathyavathi T. Applying rough set theory for medical informatics data analysis. Int J Sci Res Comput Sci Eng. 2013;1(5):1‐8. [Google Scholar]

[cpe7393-bib-0017] 17. Luo Y. Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform. 2022;23(1). doi: 10.1093/bib/bbab489 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0018] 18. Durairaj M, Ramasamy N. A comparison of the perceptive approaches for preprocessing the data set for predicting fertility success rate. Int J Control Theory Appl. 2016;9(27):255‐260. [Google Scholar]

[cpe7393-bib-0019] 19. Famili F, Shen WM, Weber R, Simoudis E. Data pre‐processing and intelligent data analysis. Intell Data Anal. 1997;1(1):3‐23. doi: 10.1016/S1088-467X(98)00007-9 [DOI] [Google Scholar]

[cpe7393-bib-0020] 20. Ruperez MJ, Martin‐Guerrero JD, Monserrat C, Alcaniz M. Artificial neural networks for predicting dorsal pressure on the foot surface while walking. Expert Syst Appl. 2011;39:5349‐5357. doi: 10.1016/j.eswa.2011.11.050 [DOI] [Google Scholar]

[cpe7393-bib-0021] 21. Almuhaideb S, Menai MEB. Impact of preprocessing on medical data classification. Front Comp Sci. 2016;10(6):1082‐1102. doi: 10.1007/s11704-016-5203-5 [DOI] [Google Scholar]

[cpe7393-bib-0022] 22. Aljame M, Ahmad I, Imtiaz A, Mohammed A. Ensemble learning model for diagnosing COVID‐19 from routine blood tests. Inform Med Unlocked. 2020;21:100449. doi: 10.1016/j.imu.2020.100449 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0023] 23. Cabitza F, Campagner A, Ferrari D, Resta C, et al. Development, evaluation, and validation of machine learning models for COVID‐19 detection based on routine blood tests. Clin Chem Lab Med. 2021;59(2):421‐431. doi: 10.1515/cclm-2020-1294 [DOI] [PubMed] [Google Scholar]

[cpe7393-bib-0024] 24. Mathiak B, Eckstein S. Five steps to text mining in biomedical literature five steps to text mining in biomedical literature. Proceedings of the 15th European Conference on Machine Learning and the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases; 2004:47‐50. https://www.academia.edu/27467487/Five_Steps_to_Text_Mining_in_Biomedical_Literature

[cpe7393-bib-0025] 25. Aljameel SS, Khan IU, Aslam N, Aljabri M, Alsulmi ES. Managing big data, visualization and its analytics in healthcare based on scientific programming 2021. Sci Program. 2021;2021:5587188. doi: 10.1155/2021/5587188 [DOI] [Google Scholar]

[cpe7393-bib-0026] 26. Chadaga K, Prabhu S, Umakanth S, et al. COVID‐19 mortality prediction among patients using epidemiological parameters: an ensemble machine learning approach. Eng Sci. 2021;16:221‐233. doi: 10.30919/es8d579 [DOI] [Google Scholar]

[cpe7393-bib-0027] 27. Seger C. An investigation of categorical variable encoding techniques in machine learning: binary versus one‐hot and feature hashing. School of Electrical Engineering and Computer Science (EECS); 2018. Click here to enter text.https://www.diva‐portal.org/smash/get/diva2:1259073/FULLTEXT01.pdf

[cpe7393-bib-0028] 28. Albayrak M, Turhan K, Kurt B. A missing data imputation using clustering and maximum likelihood estimation. Proceedings of the 2017 Medical Technologies National Congress (TIPTEKNO); 2017:242‐245: IEEE. doi: 10.1109/TIPTEKNO.2017.8238064 [DOI]

[cpe7393-bib-0029] 29. Jadhav A, Pramod D, Ramanathan K. Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell. 2019;33:913‐933. doi: 10.1080/08839514.2019.1637138 [DOI] [Google Scholar]

[cpe7393-bib-0030] 30. Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work ? Int J Methods Psychiatr Res. 2011;20(1):40‐49. doi: 10.1002/mpr.329 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0031] 31. Wiysobunri Z. Multiple imputation with multivariate imputation by chained equation (MICE) package. Ann Transl Med. 2016;4(2):30. doi: 10.3978/j.issn.2305-5839.2015.12.63 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0032] 32. Shamsolmoali P, Zareapoor M, Shen L, Sadka AH, Yang J. Imbalanced data learning by minority class augmentation using capsule adversarial networks. Neurocomputing. 2020;459:481‐493. doi: 10.1016/j.neucom.2020.01.119 [DOI] [Google Scholar]

[cpe7393-bib-0033] 33. Zhu R, Guo Y, Xue JH. Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognit Lett. 2020;133:217‐223. doi: 10.1016/j.patrec.2020.03.004 [DOI] [Google Scholar]

[cpe7393-bib-0034] 34. Douzas G, Bação F, Last F. Improving imbalanced learning through a heuristic oversampling method based on k‐means and SMOTE. Inform Sci. 2018;465:1‐20. doi: 10.1016/j.ins.2018.06.056 [DOI] [Google Scholar]

[cpe7393-bib-0035] 35. Yavaş M, Güran A, Uysal M. Classification of Covid‐19 dataset by applying Smote‐based sampling technique. Eur J Sci Technol. 2020; Special Issue:258‐264. doi: 10.31590/ejosat.779952 [DOI] [Google Scholar]

[cpe7393-bib-0036] 36. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: synthetic minority over‐sampling technique. J Arti Intell Res. 2002;16:321‐357. doi: 10.48550/arXiv.1106.1813 [DOI] [Google Scholar]

[cpe7393-bib-0037] 37. Xia W, Ma C, Liu J, et al. High‐resolution remote sensing imagery classification of imbalanced data using multistage sampling method and deep neural networks. Remote Sens. 2019;11(21):2523. doi: 10.3390/rs11212523 [DOI] [Google Scholar]

[cpe7393-bib-0038] 38. Wiysobunri B, Erden H, Toreyin B. An ensemble deep learning system for the automatic detection of COVID‐19 in x‐ray images. 2020. https://spacing.itu.edu.tr/pdf/beltus‐ytb.pdf

[cpe7393-bib-0039] 39. Alballa N, Al‐Turaiki I. Machine learning approaches in COVID‐19 diagnosis, mortality, and severity risk prediction: a review. Inform Med Unlocked. 2021;24:100564. doi: 10.1016/j.imu.2021.100564 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0040] 40. Muhammad LJ, Algehyne EA, Usman SS, Ahmad A, Chakraborty C, Mohammed IA. Supervised machine learning models for prediction of COVID‐19 infection using epidemiology dataset. SN Comput Sci. 2021;2(1):11. doi: 10.1007/s42979-020-00394-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cpe7393-bib-0041] 41. Seliya N, Khoshgoftaar TM, Hulse JM. A study on the relationships of classifier performance metrics. Proceedings of the 2009 21st IEEE International Conference on Tools with Artificial Intelligence; 2009; IEEE. Click here to enter text.doi: 10.1109/ICTAI.2009.25 [DOI]

[cpe7393-bib-0042] 42. Osareh A, Shadgar B. Machine learning techniques to diagnose breast cancer. Proceedings of the 2010 5th International Symposium on Health Informatics and Bioinformatic; 2010.Click here to enter text. doi: 10.1109/HIBIT.2010.5478895 [DOI]

[cpe7393-bib-0043] 43. Donyavi Z, Asadi S. Diverse training dataset generation based on a multi‐objective optimization for semi‐supervised classification. Pattern Recogniti. 2020;108:107543. doi: 10.1016/j.patcog.2020.107543 [DOI] [Google Scholar]

PERMALINK

Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID‐19

Gizemnur Erol

Betül Uzbaş

Cüneyt Yücelbaş

Şule Yücelbaş

Summary

1. INTRODUCTION

FIGURE 1.

2. METHODS

TABLE 1.

2.1. Data preprocessing

2.1.1. Encoding categorical values

FIGURE 2.

2.1.2. Feature scaling

TABLE 2.

2.1.3. Filling missing data

Missing data filling with KNN

FIGURE 3.

Missing data filling with MICE

FIGURE 4.

2.2. Data balancing

2.2.1. Sampling with SMOTE

FIGURE 5.

FIGURE 6.

2.3. Classification and evaluation processes

3. RESULTS

FIGURE 7.

FIGURE 8.

TABLE 3.

TABLE 4.

3.1. Comparative analysis of models based on hypothesis testing

TABLE 5.

TABLE 6.

TABLE 7.

TABLE 8.

TABLE 9.

4. DISCUSSION

5. THREATS TO VALIDITY

6. CONCLUSION

CONFLICT OF INTEREST

INFORMED CONSENT STATEMENT

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases