Abstract
The existence of widespread COVID-19 infections has prompted worldwide efforts to control and manage the virus, and hopefully curb it completely. One important line of research is the use of machine learning (ML) to understand and fight COVID-19. This is currently an active research field. Although there are already many surveys in the literature, there is a need to keep up with the rapidly growing number of publications on COVID-19-related applications of ML. This paper presents a review of recent reports on ML algorithms used in relation to COVID-19. We focus on the potential of ML for two main applications: diagnosis of COVID-19 and prediction of mortality risk and severity, using readily available clinical and laboratory data. Aspects related to algorithm types, training data sets, and feature selection are discussed. As we cover work published between January 2020 and January 2021, a few key points have come to light. The bulk of the machine learning algorithms used in these two applications are supervised learning algorithms. The established models are yet to be used in real-world implementations, and much of the associated research is experimental. The diagnostic and prognostic features discovered by ML models are consistent with results presented in the medical literature. A limitation of the existing applications is the use of imbalanced data sets that are prone to selection bias.
Keywords: Machine learning, COVID-19, Feature selection, Artificial intelligence, Diagnosis, Prognosis
1. Introduction
Coronaviruses are large RNA viruses that are known to have existed since the mid-1960s. They are responsible for causing mild to moderate upper respiratory tract illnesses, similar to the common cold [1,2]. Two well-known coronaviruses are severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV). SARS-CoV was identified in 2003, when it first appeared in the Guangdong province of southern China [3]. MERS-CoV originated in Saudi Arabia in 2012 [4]. In December 2019, new coronavirus infections appeared in the Chinese city of Wuhan, Hubei Province. On January 7, 2020, the novel virus was identified as COVID-19. Its symptoms may include fever, dry cough, myalgia, gastrointestinal symptoms, and anosmia [5]. From December 2020 to March 2020, the world witnessed a huge spread of COVID-19 infections, and the World Health Organization (WHO) declared a pandemic. According to the WHO [6], as of January 22, 2021, over 96 million COVID-19 cases and two million COVID-19 deaths have been reported globally.
Countries worldwide have been affected by the virus, resulting in various measures being enforced, including country lockdowns, curfews, and travel restrictions. Although common symptoms of COVID-19 infection are usually mild, for some patients the infection can cause serious, and occasionally deadly, complications. Managing the soaring numbers of COVID-19 cases is a huge challenge that has overwhelmed health care facilities worldwide; however, there is still insufficient information about the virus. Since the emergence of the COVID-19 infection, researchers from various disciplines have explored this novel virus. Machine learning (ML) is a branch of artificial intelligence (AI) that focuses on producing systems that are able to learn from examples and improve without being explicitly programmed [7]. ML has been applied successfully in many fields, including health care [8] and medical informatics [9]. One important research direction leverages ML to understand and fight COVID-19. Numerous lines of research have been initiated for the application and development of COVID-19-related ML algorithms. As of January 2021, a simple search of PubMed yielded 94,609 publications related to COVID-19.
A number of review papers have been published on the use of ML in COVID-19 research. Agbehadji et al. [10] summarized how big data platforms, AI models, and nature-inspired algorithms can be used for case detection and contact tracing of COVID-19. Bullock et al. [11] discussed how AI is used to address the challenges of COVID-19 at different scales, including molecular, medical, and epidemiological applications. Naudé [12] highlighted the actual and potential applications of AI in fighting COVID-19. Many applications were discussed, including: tracking and prediction, diagnosis and prognosis, treatments and vaccines, and social control. Albahri et al. [13] performed a systematic review of state-of-the-art techniques for developing COVID-19 prediction algorithms based on data mining and ML algorithms. Swapnarekha et al. [14] conducted a systematic analysis of the use of ML, deep learning, mathematical, and statistical approaches for COVID-19 prediction and screening. Lalmuanawma et al. [15] investigated existing ML techniques for COVID-19 screening, predicting, forecasting, contact tracing, and drug development. A recent survey by Tayarani-N [16]. covered how AI approaches have been employed to tackle the pandemic, including clinical applications, processing of COVID-19-related images, and pharmaceutical and epidemiological studies. Various AI techniques have been investigated in the literature, including deep learning, ML, artificial neural networks (ANN), and evolutionary algorithms. An overview of publicly available COVID-19 data sets was also summarized in Ref. [16]. Wu et al. [17] surveyed the application of big data technology for preventing and managing COVID-19 in China.
The above surveys summarized ML methods applied in the context of COVID-19. However, this research field is very active and the number of related publications is growing rapidly. Currently, numerous ML algorithms are readily available in data analysis software, such as Weka [18], which makes them easy to apply without the need for technical expertise.
The aim of this paper is to highlight ongoing efforts to use state-of-the-art ML algorithms during this pandemic. The novelty of this study, as compared to the previously published reviews, is that we mainly focus on ML diagnostic and prognostic models using simple clinical and laboratory data that can be readily available (within an hour). These approaches represent faster and cheaper diagnostic alternatives to the reverse transcription polymerase chain reaction (RT-PCR) test, with comparable, although inferior, performance [19]. Moreover, predicting unfavorable outcomes of intensive care unit (ICU) admission or death as early as the time of admission is essential to optimize decision-making and prioritize the allocation of limited resources during peaks.
The main contributions of this work can be summarized as follows.
(i) We review the recent ML algorithms in this field and focus on their potential in two main applications: diagnosis of COVID-19 and prediction of mortality risk and severity, using simple clinical and laboratory data. (ii) We analyze the main features that were found to be the most relevant to these applications. (iii) Open issues and future lines of research are highlighted based on the findings of this survey.
The remainder of this paper is structured in the following manner. In Section 2, we briefly describe some of the basic state-of-art ML algorithms. Then, in Section 4, we review recent literature on using ML for diagnosis and prediction of severity and mortality risk in COVID-19. In Section 5, we discuss the factors that have been found to be relevant to classification tasks elated to COVID-19 applications, and other issues of the existing models. Finally, in Section 6, we conclude the review and highlight future work. All the abbreviations used in this manuscript are presented in Table 1 .
Table 1.
Nomenclature of abbreviations.
Acronyms | Definition |
---|---|
ALC | absolute lymphocyte count |
ALT | alanine aminotransferase |
AST | aspartate aminotransferase |
BUN | blood urea nitrogen |
CKD | chronic kidney disease |
CRP | C-reactive protein |
cTnT | cardiac troponin T |
cTnI | cardiac troponin I |
hsCRP | high-sensitivity C-reactive protein level |
IL-6 | interleukin-6 |
INR | international normalized ratio |
LDH | lactate dehydrogenase |
MCHC | mean corpuscular hemoglobin concentration |
MCV | mean corpuscular volume |
RDW | red blood cell distribution width |
WBC | white blood cells |
TWRF | trees weighting random forest |
NB | naive Bayes |
LR | logistic regression |
SVM | support vector machine |
KNN | k-nearest neighbors |
GBDT | gradient boosted decision tree |
XGBDT | extreme gradient boosted decision tree |
NN | neural network |
XGBoost | extreme gradient boosting |
DNN | deep neural networks |
DT | decision tree |
ET | extremely randomized trees |
PLS | partial least squares |
EN | elastic net |
FDA | bagged flexible discriminant analysis |
LASSO | least absolute shrinkage and selection operator |
BN | Bayesian network |
SGD | stochastic gradient descent |
MLP | multilayer perceptron |
2. Machine learning
ML is a branch of AI that focuses on producing systems that are able to learn from examples and improve without being explicitly programmed [7]. Over the years, the ML field has gained much popularity for solving numerous real-world problems. ML techniques can be divided into three main categories: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the algorithm is allowed to learn from a data set with pre-defined labels. Classification and regression are the two main types of supervised learning. By contrast, unsupervised algorithms attempt to learn from unlabeled data sets. The algorithms work by processing the unlabeled data set to extract features and identify patterns. Examples of unsupervised ML algorithms include clustering and dimensionality reduction of large and high-dimensional data sets. In reinforcement learning, the algorithm learns through trial and error. Thus, a reward and punishment mechanism is employed in the training phase.
Data in electronic health records (EHRs) can be complex, nonlinear, multidimensional, and heterogeneous. ML can assist in fully utilizing clinical data in EHRs to facilitate fact interrogation and complex decision-making [20]. In addition, ML algorithms can be trained using millions of patient EHRs, can learn extremely complex relationships among features, and can beat human capabilities in performing complex tasks such as classification of images and discerning patterns in historical data [21]. “Combining machine-learning software with the best human clinician ‘hardware’ will permit delivery of care that outperforms what either can do alone.” [22].
The present study focuses on two main applications: diagnosis of COVID-19 and prediction of mortality risk and severity. Both are usually formulated as classification (or regression) problems.
Well-known algorithms used for classification tasks for COVID-19 data sets include the following.
-
•
Naive Bayes (NB) is a simple probabilistic classifier based on Bayes' theorem. Given a record X and a number of m classes , NB classification maximizes using Bayes' theorem, as follows:
(1) |
where, , , and may be estimated from the given data. The word naive refers to the main assumption of conditional independence. NB assumes independence among class attributes. This is not necessarily true in real-world applications.
-
•
Support vector machine (SVM) is a classification algorithm that transforms a training data set into a higher dimension [23]. It optimizes a hyperplane that separates the two classes with minimum classification errors. The hyperplane is represented as follows:
(2) |
where W is a weight vector, and b is a scalar denoting bias.
-
•
Decision tree (DT) is an algorithm that produces a tree-structured model to describe the relationships between attributes and a class label [24]. It works by recursively dividing observations based on the most informative attribute with the highest gain ratio value calculated as follows:
(3) |
where, denotes the possible information provided by partitioning the dataset, D, into v partitions and denotes the amount of information obtained by partitioning the dataset based on attribute A [25].
-
•
Random forest (RF) is a DT ensemble method that creates multiple trees through a re-sampling process called bagging (bootstrap aggregation) [26]. Numerous DTs are constructed by re-sampling using bootstrapping with replacement. Each node of the tree is split using a subset of the attributes that are selected randomly for each tree. Class membership for a new example is identified as the most commonly predicted class from the (aggregated) DTs by a simple unweighted majority vote.
-
•
AdaBoost, which stands for adaptive boosting, is an ensemble algorithm used to combine the results from multiple learning models using boosting [27]. It builds the models sequentially. The successive model is boosted by the re-weighting of instances in accordance with previous model outputs.
-
•
K-nearest-neighbor (KNN) is a classifier that learns by comparing a given unlabeled data point with the training data set [28]. It searches for the K most similar data points, referred to as the KNNs. A distance metric, such as Euclidean distance, is usually used to measure closeness. The algorithm then finds the most common class among its KNNs and assigns it to the given data point.
-
•
Gradient-boosted DT (GBDT) is an ensemble method that sequentially builds a set of trees [29]. In each iteration, a tree is improved on the basis of its performance in the previous iteration. A GBDT comprises three elements: a loss function, a weak learner (e.g., a DT), and an additive model.
-
•
Logistic regression (LR) models the probability of data points belonging to a certain class based on the value of independent features. It then uses the model to predict the probability that a given data point belongs to a certain class. Usually, the sigmoid function is used in building the regression model. It is assumed that the data points follow a linear function. LR is described as follows:
(4) |
where, p is the probability that X belongs to class C and are model parameters.
-
•
Artificial Neural Network (ANN) are classification algorithms inspired by the structure and workings of the brain. A typical ANN consists of a set of connected units, where connections are associated with weight. Information is propagated across the network layers, where the output at layer n is calculated as follows:
(5) |
where, is the activation function used. During the learning process, connection weights are adjusted in order to improve the prediction accuracy of the network. Weights are adjusted using error calculated during back propagation as follows:
(6) |
-
•
Extremely randomized trees (ET) The extremely randomized trees [30] algorithm is a tree induction ensemble algorithm. It is different from other tree-based ensemble methods with respect to two main aspects: it splits nodes by selecting cutoff points fully at random, and it uses the entire learning sample (rather than a bootstrap replica) to grow the trees. Machine learning algorithms and their main features are summarized in Table 2 .
Table 2.
Machine learning algorithms and their main features.
ML Algorithm | Basic Idea | Features |
---|---|---|
NB | Probabilistic classifier | Cannot handle missing data, stable performance [25]. |
SVM | Hyperplane optimization | Highly accurate models, less likely to suffer from overfitting, used for prediction and classification tasks. |
DT | tree-structured model | Robust, for categorical data, easy to interpret. |
RF | DT ensemble method | Effective for highly complex problems, best for high-dimensional data sets, can handle missing data and imbalanced data sets. |
AdaBoost | Ensemble algorithm | Improves the performance of individual weak classifiers, sensitive to noise. |
KNN | Based on a distance metric to measure the distance between data points. | Choice of a distance metric affects performance; known as lazy learner, as it does not perform any analysis until it is presented with a testing data point. |
GBDT | Ensemble tree induction, seeks to produce a model that minimizes the loss function | Highly flexible [31]. |
LR | Predicts the probability that a given data point belongs to a certain class | Easy calculation, can handle continuous numerical values, cannot handle non-linear data. |
ANN | Inspired by networks of biological neurons | Highly accurate models, difficult to interpret the model (black-box models), requires a large number of parameters. |
ET | Ensemble tree induction | Good performance, easy to implement, less computational time, fewer optimization parameters [32]. |
3. Methodology
In this review, we performed search queries on online databases including PubMed, Scopus, IEEE Xplore, and Google Scholar. We also screened the reference lists of the included articles to find other relevant studies to include in this review. The search terms included COVID-19, SARS-CoV-2, machine learning, artificial intelligence, diagnosis, prognosis, mortality, severity, and laboratory. We focused on ML-based approaches used for predicting COVID-19 diagnosis and the prognosis of mortality and severity, using only simple clinical and laboratory data that were readily available from a public health agency. However, we excluded studies that used computed tomography (CT) scans and X-rays in their prediction models. We considered studies published in English between January 2020 and January 2021, and retrieved 645 results by searching the databases. We then removed duplicates and screened the remaining studies. We identified 52 studies (with 76 models) that met the eligibility criteria and thus were included in this review. Fig. 1 describes the study selection process.
Fig. 1.
Flowchart of the study selection process.
4. ML applications for COVID-19
ML has been applied successfully in numerous fields, including finance [33], manufacturing [34], transportation [35], and education [36]. Of particular interest are the applications of ML in health care [8]. AI and ML can be used to improve diagnosis, prognosis, monitoring, and administration of treatments to enhance patients’ health outcomes [37]. Since the beginning of the COVID-19 outbreak, there has been a growing interest in using ML to tackle the pandemic. In this section, we review some of the work done using ML for the diagnosis of COVID-19 and for mortality risk prediction. A schematic showing the relationship between ML approaches and the applications reviewed in this article is presented in Fig. 2 .
Fig. 2.
Relationships between ML approaches and COVID-19 applications reviewed in this article.
4.1. Diagnosing COVID-19
With the continuing increase in numbers of COVID-19 infections, it has become extremely important to identify patients as early as possible in order to control the spread of the disease. The current technique for detecting COVID-19 is RT-PCR [38]. In this test, specimens are first collected from the upper or lower respiratory system of a patient. Then, the RNA is extracted by following a pre-defined protocol. Once the RNA strand has been extracted, PCR amplification is performed.
Although RT-PCR is considered the gold standard for COVID-19 detection, it has numerous limitations. The execution of the RT-PCR test requires laboratory settings with specialist equipment and trained staff [19]. A single RT-PCR run is expensive and takes approximately 4–5 h. Generally, the PCR machine is run with batches of samples in order to reduce costs. False negative tests have been well documented, with estimated rates between 2% and 33% in repeat sample testing [39]. A false negative result has undesired consequences, as it leads to further spread owing to the patient not being isolated.
CT scans have been widely explored as a complement or an alternative to RT-PCR tests. Some CT findings are suggestive of COVID-19, such as [[40], [41], [42]]; however, they cannot rule out or confirm the diagnosis. There are also the drawbacks of exposing patients to unnecessary irradiation [43] and overwhelming the health system's limited resources. For these reasons, the American College of Radiology and the Centers for Disease Control recommend not using chest radiographs (CXR) or CT scans for screening or first-line diagnosis of COVID-19 [44].
Although many patients may develop mild or no symptoms, there is a risk of transmission of the virus from these asymptomatic or mildly symptomatic individuals. Thus, it is critical for health care providers to have tools for prediction and early diagnosis of COVID-19. In ML terms, the task is usually formulated as a classification problem. ML models are trained to be able to classify a patient as COVID-19 positive or negative. In order to mitigate the limitations of RT-PCR tests and CT scans, numerous attempts have been made to utilize ML algorithms for the detection of COVID-19.
On the other hand, clinical data and routine blood tests can represent a faster and cheaper diagnostic alternative with comparable, although inferior, performance [19]. Although viral testing is still the only specific method of diagnosis [44], accurate and fast models can be incredibly valuable during a pandemic peak to mitigate shortages of reference tests and to slow down the outbreak by early isolation of potential COVID-19 patients [19,45,46]. These models can also be used to cross-check RT-PCR tests where false negative results are well documented [47,48].
The models reported in this review employed various ML approaches to predict COVID-19 diagnosis and prognosis. Some studies used a single model, whereas others used several models, selected the best one, or employed a combination of approaches to build their prediction model. The following is a summary of the diagnostic models classified by the selected ML approach.
4.1.1. XGBoost model
Li et al. [46] developed a classification model based on XGBoost to discriminate between influenza and COVID-19 patients. The model predicts the presence of COVID-19 based on patient symptoms and routine test results. They re-analyzed COVID-19 data from 151 published studies, encompassing clinical data from 413 patients. They found that age, CT scan results, temperature, lymphocyte levels, fever, and cough were the most important features of their prediction model. The prediction results achieved a sensitivity of 92.5% and a specificity of 97.9%.
In Slovenia, Kukar et al. [49] utilized RF, deep neural networks (DNN), and XGBoost. The algorithms were used to develop models to predict COVID-19 diagnosis using routine blood test results, age, and sex. The authors used a data set from 5333 patients, of whom 160 were positive, admitted to the Department of Infectious Diseases, University Medical Centre Ljubljana. XGBoost performed the best, achieving an Area under the ROC Curve (AUC) of 97%, sensitivity of 81.9%, and specificity of 97.9%. The highest-ranking features found by the model to predict COVID-19 diagnosis were mean corpuscular hemoglobin concentration (MCHC), eosinophil count, albumin, international normalized ratio (INR), and prothrombin activity percentage. According to the study, the features that contributed the most to differentiation between COVID-19 and bacterial infections were: urea, hemoglobin, erythrocyte count, hematocrit, and leukocyte count. The top-ranking features for differentiating between COVID-19 disease and other viral infections were MCHC, eosinophil ratio, prothrombin, INR, prothrombin activity percentage, and creatinine.
Bayat et al. [48] developed a model to predict COVID-19 based on standard laboratory tests. A large data set consisting of 75,991 patients (7335 positive) was obtained from the US Department of Veterans Affairs. The study employed XGBoost to build the model, which achieved a specificity of 86.8%, a sensitivity of 82.4%, and an overall accuracy of 86.4%. The study concluded that the top 10 features in descending order of importance were: serum ferritin, white blood cell (WBC) count, eosinophil count, patient temperature, C-reactive protein (CRP), serum lactate dehydrogenase (LDH), D-dimer, basophil count, monocyte percentage, and serum aspartate aminotransferase (AST).
4.1.2. RF model
Wu et al. [50] built an assistant discrimination tool using RF to quickly and accurately identify COVID-19 patients based on key blood indices from clinical blood test data. The data set, containing a total of 253 samples from 169 suspected COVID-19 patients, was collected from multiple sources in China. In addition, 105 consecutive positive samples were collected from 27 patients with confirmed COVID-19 cases. Initially, 49 features were employed to build the model in order to evaluate the importance of each feature. Based on the top-ranking features, 11 were selected as final input indicators. Changes in several of these laboratory parameters have been widely reported as important clinical references, including total bilirubin, glucose, creatinine, LDH, creatine kinase isoenzyme (CK-MB), and potassium. However, there were also features that had not received extensive attention, including total protein, calcium, magnesium, platelet distribution width (PDW), and basophils. The study found that these features played an irreplaceable part in the RF algorithm, indicating that they have great potential as diagnostic markers in future clinical practice. The method's performance on an independent test set was consistent with that achieved on the training set, with an AUC of 99.26%, a sensitivity of 100%, and a specificity of 94.44%.
Brinati et al. [51] developed a model to predict COVID-19 diagnosis. A data set from 279 patients (177 positive) from IRCCS Ospedale San Raffaele, Italy, was used. The study employed several ML approaches: DT, ET, KNN, RF, LR, NB, SVM, and trees weighting RF (TWRF). The RF classifier was found to achieve the best performance, with an AUC of 84%, accuracy of 82%, sensitivity of 92%, and specificity of 65%. The most important features in predicting COVID-19 diagnosis were: AST, lymphocytes, LDH, WBC, eosinophils, alanine transaminase (ALT), and age.
Tschoellitsch et al. [52] developed a model using an RF ML algorithm to predict the diagnosis of COVID-19 based on routine blood tests. A data set of 1528 patients (65 positives) was employed to build the model, which achieved an accuracy of 81%, an area under the receiver operating characteristic curve (AUC) of 0.74, a sensitivity of 60%, and a specificity of 82%. The study found that the most important features in predicting diagnosis were: leukocyte count, red blood cell distribution width (RDW), hemoglobin, and serum calcium.
4.1.3. LR model
Joshi et al. [53] developed a LR model to predict COVID-19 PCR positivity based on complete blood count components and patient sex. The model was trained using 33 records of positive cases and 357 of negative cases from Stanford Health Care. The study selected three complete blood count (CBC) components (absolute neutrophil count, absolute lymphocyte count, hematocrit), and male sex as features in the model. Validation was conducted on a data set of 236 positive cases and 2052 negative cases. The model achieved a C-statistic of 78%, sensitivity ranging between 86% and 93%, and specificity ranging between 35% and 55%. The authors explained that the model could be restricted to predicting positive patients, thereby enabling a 33% increase in appropriately allocated resources.
Shoer et al. [54] developed a prediction model based on nine simple survey questions. The study used a data set derived from a national symptom survey answered over two million times in Israel. A total of 43,752 adults were included, of whom 498 self-reported being COVID-19 positive. The survey questions were related to age, gender, prior medical conditions, smoking habits; and self-reported symptoms including fever, sore throat, cough, shortness of breath, and loss of taste or smell. The model was trained using an LR algorithm and achieved an AUC value of 0.737.
Tordjman et al. [55] used a data set of 400 patients (258 positive) from three different hospitals in France: Cochin Hospital, Paris; Ambroise Paré Hospital, Boulogne; and Raymond Poincaré Hospital, Garches. The study employed a binary LR algorithm to build a scoring model to predict the probability of a positive COVID-19 diagnosis. The model achieved an AUC of 88.9%, a sensitivity of 80.3%, and positive predictive value (PPV) of 92.3%. The study suggested that four biological variables were highly associated with COVID-19 diagnosis: lymphocytes, eosinophils, basophils, and neutrophils.
4.1.4. Other ML models
Yang et al. [45] used patient demographic features (age, sex, race) and 27 routine laboratory tests for COVID-19 detection. The study used a data set of 5893 patients (with only 1402 RT-PCR positive cases) from the New York Presbyterian Hospital/Weill Cornell Medicine, USA. The study employed LR, DT, and RF, as well as GBDT, which was selected as it achieved the best performance. The model was validated using an independent data set, where it achieved an AUC value of 0.838, a sensitivity value of 0.758, and a specificity value that reached 0.740. The results indicated that higher inflammatory markers—such as LDH, ferritin, and CRP—led to positive prediction. Moreover, lower levels of lymphocyte count were found to drive negative predictions.
Soltan et al. [56] developed two models to detect COVID-19 patients at an early stage using routinely collected data that can be available within 1 h (laboratory tests, blood gas, and vital signs). The study used a data set of 114,957 patients, of whom 437 were positive, from emergency and acute medical services at Oxford University Hospitals. The ML algorithms used were LR, RF, and extreme gradient boosted tree (XGBDT), the latter of which yielded the best performance. The emergency department model achieved an AUC of 93.9%, a sensitivity of 77.4%, and a specificity of 95.7% for all patients attending hospital; the admissions model achieved an AUC of 94%, a sensitivity of 77.4%, and a specificity of 94.8% for patients admitted to hospital. According to the model, the highest-ranking features were: laboratory blood markers (CRP, eosinophils, and basophils), vital signs (oxygen requirement and respiratory rate), and blood gas measurements (calcium and methemoglobin). The admission model, however, demonstrated markedly higher weights for CRP and WBC counts and lower weights for blood gas measurements.
In Brazil, de Moraes et al. [57] developed a model using a data set of 235 COVID patients, of which 102 were positive cases. The study employed neural networks (NN), RF, SVM, GBDT, and LR. The SVM algorithm was selected as it achieved the best performance: an AUC of 85%, a sensitivity of 68%, and a specificity of 85%. The model suggested that the three most important variables for predicting diagnosis were the numbers of lymphocytes, leukocytes, and eosinophils.
Alakus and Turkoglu [58] developed a model for the detection of COVID-19. The authors used a data set of 600 patients with 18 features obtained from Hospital Israelita Albert Einstein at Sao Paulo, Brazil. The features represent laboratory test results including hematocrit, hemoglobin, platelets, and red blood cell count. In addition, the performance of six deep learning architectures was evaluated: ANN, CNN, Long short-term memory (LSTM), recurrent neural network (RNN), CNNLSTM, and CNNRNN. The study concluded that the LSTM had the best performance when 10-fold cross-validation was performed, achieving an accuracy of 86.66%, recall of 99.42%, and AUC of 62.50%.
4.1.5. Combinations of several ML approaches
Cabitza et al. [19] trained ML algorithms for the detection of COVID-19 positive patients. The following five algorithms were used: RF, NB, LR, SVM, and KNN. This study used three different training data sets for 1624 patients admitted to San Rapheal Hospital, Italy. The complete data set consisted of 72 features describing numerous aspects of patient records, including CBC; biochemical, coagulation, hemogasanalysis, and CO-oxymetry values; age; sex; and specific symptoms at triage. The other two sub-data-sets consisted of 32 and 21 features, respectively. The performance, evaluated in terms of the AUC of the models, ranged from 0.83 to 0.90. The internal–external validation obtained good results, with AUCs ranging from 0.75 to 0.78 and specificity values from 0.92 to 0.96. The study reported LDH, AST, CRP, and calcium as the most important features. Age was also found to be a significant predictor. In addition, fibrinogen, cross-linked fibrin degradation products (XDPs), and WBC were among the essential features.
An ensemble of seven traditional ML algorithms was developed by Goodman-Meza et al. [59] for the diagnosis of COVID-19. The authors used RF, LR, SVM, multilayer perceptron, stochastic gradient descent, XGBoost, and Adaboost. The study was conducted on a data set extracted from electronic medical records of the UCLA Health System (Los Angeles, CA, USA). The data set consisted of 1455 records, with 1273 negative and 182 positive cases. Each case was described using demographic and laboratory features. Experimental analysis revealed that inclusion of inflammatory markers could improve the prediction of the model. In particular, the most important features for the diagnosis of COVID-19 were CRP and LDH. Although the proposed model achieved a high sensitivity value, it had a high number of false positives (low PPV value). The model's ability to detect negative cases was 64%.
AlJame et al. [60] also used routine blood tests to predict COVID-19. A publicly available data set of 5644 patients in Brazil, with only 559 COVID-19 positives, was used. Only 18 features were included, on the basis of other clinical studies that had demonstrated the importance of these features. The features, ordered from the highest-to the lowest-weighted, according to the model results, were: monocytes, platelets, leukocytes, urea, potassium, eosinophils, hemoglobin, lymphocytes, CRP, creatinine, AST, sodium, neutrophils, INR, age, basophils, ALT, and, finally, albumin. The model initially used extra trees, RF, and LR as a first level, and then used the predictions of this level as input features for a second level that employed XGBoost in order to boost the performance. The resultant performance was higher than that of other models on the same data set. The model achieved an overall accuracy of 99.88%, AUC of 99.38%, sensitivity of 98.72%, and specificity of 99.99%.
Feng et al. [61] developed a model for early diagnosis of COVID-19 patients. A data set of 132 patients (26 positive) was obtained from the Chinese People's Liberation Army General Hospital in Beijing. The study compared the performance of several ML approaches and found that LR with the least absolute shrinkage and selection operator (LASSO) achieved the best performance in the external validation set and testing set, with AUCs of 0.938 and 0.841, recall of 1.000 and 1.000, and specificity of 0.778 and 0.727, respectively. The model suggested that the most important features in predicting diagnosis were: age, interleukin-6 (IL-6), systolic blood pressure, monocyte ratio, and fever classification.
Soares et al. [62] developed a model to predict diagnosis of COVID-19. The study used a data set of 599 patients (81 positives) from Brazil. The model was trained using a combination of three techniques: SVM, SMOTEBoost, and ensembling. The model achieved a specificity of 92.16%, NPV of 95.29%, and sensitivity of 63.98%.
4.2. Predicting mortality risk and severity
Early identification of high-risk COVID-19 patients is essential, as it can facilitate the establishment of more responsive health care systems and ensure instant intervention and intensive care, thereby improving patient outcomes. Moreover, early recognition of critical patients can help to mitigate the burden on health systems, enabling them to prioritize the allocation of limited resources during peaks and optimize decision-making [20].
Several prognostic scores that aim to improve clinical decision making have been broadly used for respiratory infections pre-COVID-19 and have been validated by national and international guidelines [63,64]. For example, the CURB-65 score (confusion, urea, respiratory rate, blood pressure, and age below 65 years) and the pneumonia severity index are widely used in predicting 30-day mortality. In addition, A-DROP, which is a modified version of CURB-65.
The National Early Warning Score 2 predicts death or ICU admission within 24 h. The quick Sequential [sepsis-related] Organ Failure Assessment score predicts mortality and ICU admission among patients with suspected infection in emergency departments and in ward settings.
However, there is insufficient information available regarding the validity of these scores in the COVID-19 setting, and some of them have been found to underestimate mortality compared with their original validation in non-COVID-19 patients [63].
Fast and accurate prognostic tools are needed to stratify COVID-19 patients early, predict mortality and critical outcomes, and wisely direct limited resources to patients most in need; this is particularly crucial during pandemic peaks. According to Pollack [65], “severity of illness is defined as the extent of physiological decompensation or organ system loss of function; in contrast, risk of mortality refers to the likelihood of dying.” In ML, the task of predicting mortality risk and severity is usually formulated as a classification problem. ML algorithms are built to predict whether or not confirmed COVID-19 patients will develop critical complications. The following is a summary of the reported prognostic models classified by the selected ML approach.
4.2.1. XGBoost model
Valid et al. [66] developed a model to predict mortality and critical events of hospitalized COVID-19 patients. The study used a data set of 4098 patients from five hospitals in New York City, USA. The ML algorithm XGBoost and baseline comparator models were used to build the prediction model. The model achieved AUCROC values for mortality of 0.89 at 3 days, 0.85 at 5 and 7 days, and 0.84 at 10 days; moreover, AUCROC values of 0.80 at 3 days, 0.79 at 5 days, 0.80 at 7 days, and 0.81 at 10 days were obtained for critical event prediction. The study found that the critical features for predicting mortality were older age, anion gap, and CRP, while the strongest effectors for predicting a critical event were acute kidney injury on admission, elevated LDH, tachypnea, and hyperglycemia, at 7 days.
Yan et al. [67] developed a model to predict criticality and mortality in COVID-19 patients. The study used a data set from 375 patients (201 survivors) from Tongji Hospital in Wuhan. The ML algorithm XGBoost was employed and achieved an accuracy of 93%. With this model, the key features for predicting mortality risk were: LDH, lymphocyte, and high-sensitivity CRP (hs-CRP).
Yan et al. [68] developed a model to predict the survival of COVID-19 patients. A data set of 404 infected patients was used (213 survivors and 191 non-survivors) from Tongji Hospital in Wuhan, China. The study employed the XGBoost classifier, which recognized that the most discriminative features of patient survival were LDH, lymphocytes, and hs-CRP. The model achieved an accuracy exceeding 90%.
Wang et al. [69] developed two models to predict mortality in COVID-19 patients. The study used a data set from 296 patients (of whom 19 died) from the First People's Hospital of Jiangxia District in Wuhan, China. The ML algorithm XGBoost was employed to build the models. The clinical model was built using age, history of hypertension, and coronary heart disease and achieved an AUC of 83%. The laboratory model was established using age, hsCRP, oxygen saturation (SpO2), neutrophil and lymphocyte count, D-dimer, AST, and glomerular filtration rate. This model achieved a better performance, with an AUC of 88% in the validation cohort.
Rechtman et al. [70] developed a model to predict COVID-19 mortality in the New York hospital system. The study used a data set of 8770 patients (1114 non-survivors) and selected the XGBoost algorithm to build the prediction model, which achieved an AUC of 0.86. The study found that the risk factors for COVID-19 mortality were older age, male sex, higher body mass index (BMI), higher respiratory rate, higher heart rate, and chronic kidney disease (CKD).
Bertsimas et al. [71] developed the COVID-19 Mortality Risk tool using the XGBoost algorithm to predict mortality. The tool was built using a data set of 3927 COVID-19 positive patients from 33 different hospitals across Europe and the US. The model was validated using three validation cohorts and achieved AUC values ranging between 0.92 and 0.81. The study found that the primary risk factors for mortality were increased age, decreased oxygen saturation, elevated CRP levels, blood urea nitrogen (BUN), and blood creatinine.
Guan et al. [72] developed a model to predict COVID-19 mortality in a retrospective cohort study. A data set of 1270 patients was employed to build the model using the XGBoost ML algorithm. The model could accurately predict death risk, exceeding a precision of 90%, a sensitivity of 85%, and F1 scores of 0.90. The study found that disease severity, age, and serum levels of hs-CRP, LDH, ferritin, and IL-10 were the most significant predictors of death risk in patients with COVID-19.
4.2.2. SVM
Booth et al. [73] developed a ML model to predict mortality in COVID-19-positive patients using only a multiplex of serum biomarkers that could be quickly obtained from most clinical chemistry laboratories. The data set from the University of Texas Medical Branch was collected from 398 patients (355 survivors and 43 non-survivors from COVID-19) to predict death up to 48 h in advance. The study employed the ML techniques LR and SVM to build the prediction model. From the 26 parameters that were initially collected, the top five highest-weighted laboratory values were then selected—CRP, BUN, serum calcium, serum albumin, and lactic acid. The SVM model achieved 91% sensitivity and 91% specificity (AUC 0.93) for predicting patient death. The study suggested that CRP, lactic acid, and serum calcium had the most substantial impact and made the greatest contribution to model results when considered over the entire data set.
Sun et al. [74] developed a model to predict severe symptoms in COVID-19 patients. A data set of 336 patients from Shanghai Public Health Clinical Center was used. The ML algorithm SVM was employed to build the prediction model. Out of 220 clinical and laboratory features, the model selected four features that made the greatest contribution—age, GSH, CD3 ratio, and total protein. The model achieved an AUC of 97.57%.
Yao et al. [75] developed a model to predict the severity of COVID-19 using blood or urine test data. The data set consisted of 137 patients (75 severely ill) from the Tongji Hospital Affiliated to Huazhong University of Science and Technology. The ML algorithm SVM was used to build the severeness detection model, which achieved an accuracy of 81.48%. The highest-ranking features detected by the model were age, blood test values (neutrophil percentage, calcium, and monocyte percentage), and urine test values (urine protein, red blood cells (occult), and pH (urine)).
Zhao et al. [76] developed a model for prediction of severity in patients with moderate COVID-19. Six key features were eventually selected out of 22 features using univariate and multivariate LR models. The study employed the SVM algorithm to build the prediction model, which achieved an accuracy of 91.38%, a sensitivity of 0.90, and a specificity of 0.94. The top six features for predicting severity were: IL-6, high-sensitivity cardiac troponin I (cTnI), procalcitonin, hsCRP, chest distress, and calcium.
4.2.3. LR
Hu et al. [77] developed a ML model for early prediction of the mortality risk of COVID-19 patients. A data set of 183 patients (115 survivors and 68 non-survivors from COVID-19) from the Sino-French New City Branch of Tongji Hospital, Wuhan, was used to build the prediction model. In addition, a total of 64 patients (33 survivors and 31 non-survivors from COVID-19) from the Optical Valley Branch of Tongji Hospital, Wuhan, were used to externally validate the final predictive model. Demographic, clinical, and first laboratory data after admission were extracted from patients' medical records. The study initially attempted 10 methods and then selected five of them (LR, partial least squares (PLS) regression, elastic net (EN) model, RF, and bagged flexible discriminant analysis (FDA)) according to the model's performance and property to be reported. The LR model, RF, and bagged FDA yielded similar performance, as measured by the AUC. LR was selected as the final model because of its simplicity and high interpretability. The most essential four variables selected by the models were: age, hsCRP level, lymphocyte count, and D-dimer level. The performance of the model was evaluated using both 10-fold cross-validation on the training data set and independent testing using the external validation set. The AUC, sensitivity, and specificity reached 89.5%, 89.2%, and 68.7% during cross-validation and 88.1%, 83.9%, and 79.4% with independent testing, respectively. The study found that non-survivors were more likely to be male and older than survivors. Moreover, levels of all the inflammatory factors were higher in the non-survivors than in the survivors. In particular, levels of hsCRP and D-dimer were more than six times and almost three times higher in non-survivors than in survivors, respectively. Conversely, the lymphocyte count was almost twice as high in the survivors as in the non-survivors. The study offers a web tool to calculate a risk score based on the four selected variables (age, hsCRP, lymphocyte count, and D-dimer), which could enable the adoption of more interventions at an early stage.
Zhao et al. [78] developed a risk-score model to predict mortality and ICU admission. The study used a data set from 641 laboratory-confirmed COVID-19 patients (195 admitted to the ICU, 82 expired) from Stony Brook University Hospital, USA. Symptoms, comorbidities, demographics, laboratory findings, vital signs, and imaging findings were all compared with those of non-critical COVID-19 patients to identify the most significant variables predicting the two outcomes. The study employed the ML approach and LR and achieved good accuracy with an AUC of 0.83 for mortality prediction and 0.74 for ICU admission prediction on the testing data set. The study found that the common top predictors of mortality and ICU admission were elevated LDH, procalcitonin, and reduced SpO2. Moreover, a reduced lymphocyte count and smoking history were among the top predictors of ICU admission but were not associated with increased mortality in this study. On the other hand, cardiopulmonary parameters (i.e., history of heart failure, chronic obstructive pulmonary disease (COPD), elevated heart rate) were among the top predictors of mortality in COVID-19 patients, but ICU admission was not.
Huang et al. [79] developed a model to predict progression to severe symptoms among COVID-19 patients. The study used a data set of 125 COVID-19 patients (93 mild disease, 32 severe) from Guangzhou Eighth People's Hospital, China. The ML algorithm LR was employed to build the model, which achieved an AUC of 94.4%, a sensitivity of 94.1%, and a specificity of 90.2%. The authors explained that while as many as 17 features notably differed between the mild and severe groups at the time of admission, according to the model, only four factors were independently associated with progression to a severe condition: comorbidities, respiratory rate, CRP, and LDH.
Xie et al. [80] developed a model to predict mortality among COVID-19 patients. The study used a data set of 444 patients from two different hospitals (Tongji Hospital and Jinyintan Hospital, Wuhan, China). The LR algorithm was employed to build the prediction model, which achieved c = 0·89 and c = 0·98 for internal and external validation, respectively. The model identified four independent factors for predicting mortality: age, lymphocyte count, LDH, and SpO2. However, the authors explained that a few factors were not available in the study's cohorts and were thus not included in the model; these factors included D-dimer and organ-specific injury markers (including cTnI, ALT, and BUN). This might have affected the choice of the proposed prognostic factors.
Zhou et al. [81] developed a model to predict the severity of infection in COVID-19 patients. The study used a data set of 377 patients (172 severe, 106 non-severe) from the Central Hospital of Wuhan, China. The LR model was employed to build the prediction model, which achieved an AUC of 87.9%, a specificity of 73.7%, and a sensitivity of 88.6%. The results suggested that three independent factors were associated with severity in patients with COVID-19: age, CRP, and D-dimer. Moreover, the product N/L*CRP*D-dimer was found to be a significant predictor of the severity of the disease.
Zhu et al. [82] developed a model to assess the severity of infection among COVID-19 patients. The study used a data set of 127 patients (16 severe) from Hwa Mei Hospital, University of Chinese Academy of Sciences, Ningbo, China. LR was employed to build the risk prediction model, which achieved an AUC of 90.0%. The study observed significant increases in neutrophils percentage, neutrophil-to-lymphocyte ratio, fibrinogen, SA, CRP, IL-6, IL-10, IFN-g, pO2, and pCO2, and decreases in lymphocytes percentage, lymphocyte count, and platelet count in the severe group. The independent risk factors for assessing the severity were: high levels of peripheral blood cytokine IL-6, CRP, and hypertension. The study emphasized the essential role of IL-6 in the severity of COVID-19.
Gong et al. [83] developed a model to predict the risk of progression to severe COVID-19. The study used a data set of 372 hospitalized patients from China. Five ML approaches were employed: LASSO regression, LR, DT, RF, and SVM. The LR model was selected for further analysis to build the model, which achieved an AUC of 0.853, sensitivity of 77.5%, and specificity of 78.4%. The study identified seven features associated with higher odds of severe COVID-19: older age, elevated serum LDH, CRP, coefficient of variation of RDW, BUN, direct bilirubin, and lower albumin on admission.
Aloisio et al. [84] developed a model to predict mortality and ICU admission among COVID-19 patients. A data set of 427 patients (89 deaths) was obtained from the ‘Luigi Sacco’ academic hospital in Milan, Italy. The study employed univariate and multivariate LR analysis. The univariate analysis revealed that the parameters cTnT, LDH, CRP, albumin, D-dimer, and ferritin were all associated with higher probabilities of death and intensive care. On the other hand, the multivariate analysis revealed that age, high serum concentration of LDH, and low serum concentration of albumin were significantly associated with death.
Liu et al. [85] developed a model to predict mortality in COVID-19 patients using a data set of 336 severely ill patients (34 of whom died) from China. The study employed multivariable LR to build the prediction model, which achieved an AUC of 99.4%, a sensitivity of 100.0%, and a specificity of 97.2%. The study found that decreased lymphocyte ratio, elevated BUN, and D-dimer were highly associated with death.
Miao et al. [86] developed a model for early prediction of in-hospital mortality of COVID-19 patients, using a data set of 1018 patients who were confirmed to have COVID-19. The study employed univariate and multivariate LR analyses to build prediction models and found that the model combining increased IL-6 and decreased CD8+ T cell count achieved the best performance, with an AUC of 0.907.
Han [87] developed a model to identify the risk factors that led to progression to severe COVID-19. A data set of 47 patients (24 severe) with confirmed COVID-19 from Renmin Hospital of Wuhan University was used. The study employed LR analysis and found that APACHE II, SOFA, lymphocytes, CRP, LDH, AST, cTnI, and BNP were significant independent risk factors for COVID-19 severity. Moreover, it concluded that LDH had tremendous potential as a prognostic factor for early recognition of lung injury and severe COVID-19 events, with an AUC of 0.9727, maximum sensitivity of 100.0%, and specificity of 86.67%. In addition, lymphocyte counts—particularly CD3, CD4, and CD8 T cells—were also closely associated with the severity of COVID-19, with an AUC of 0.9845, maximum specificity of 91.30%, and sensitivity of 95.24%. Further, the study recognized that LDH was positively correlated with CRP, AST, BNP, and cTnI but negatively correlated with lymphocyte cells and their subsets.
4.2.4. Other models or combinations of several models
Li et al. [88] developed a model to predict the mortality risk of COVID-19 patients, given the patient's underlying health conditions, age, sex, and other factors. Two publicly available data sets were used: the GitHub data set, consisting of 28,958 cases (530 deaths) after processing; and the Wolfram data set, consisting of 1448 records (123 deaths). Better results were achieved with the Wolfram data set than with the Github data set, which lacked precise information and contained mostly generalized information on each case, thereby limiting the prediction capability of models. Several ML and data learning approaches were employed: LR, RF, SVM, one-class SVM, isolation forest, local outlier factor, and autoencoder. The autoencoder model achieved the best results—approximately 73% AUC, 97% accuracy, 97% specificity, and 40% sensitivity. The results indicated that having a chronic disease or gastrointestinal, kidney, cardiac, or respiratory symptoms was markedly associated with patient death.
Terwangne et al. [89] developed a model named COVID-19 EPI-SCORE to predict the severity classification of patients hospitalized with COVID-19. The study also aimed to assess the WHO COVID-19 severity classification. The data set used was obtained from 295 RT-PCR-positive COVID-19 patients hospitalized in Epicura Hospital Center, Belgium. An ML approach, Bayesian network analysis, was employed to build the EPI-SCORE model and predict the accuracy of WHO severity classification. The six features that contributed most and that were automatically selected by the model were WHO severity classification, acute kidney injury, age, LDH, lymphocytes, and activated prothrombin time (aPTT). The model obtained ROC curve indices of 83.8% and 91% for the models based on WHO classification only and for EPI-SCORE, respectively. The study demonstrated that the WHO severity classification is a reliable tool for predicting severe outcomes among COVID-19 patients. Moreover, adding a few clinical and laboratory features could remarkably enhance its performance, as per the COVID-19 EPI-SCORE model.
Izquierdo et al. [90] developed a model to predict ICU admission using an ML data-driven algorithm. The study used a data set of 10,504 COVID-19 patients (1353 hospitalized, 83 admitted to ICU) from the general population of the region of Castilla-La Mancha (Spain), which included clinical information regarding the diagnosis, progression, and outcome of the infection. A DT algorithm was employed. The model achieved accuracy, recall, and AUC values of 0.68, 0.71, and 0.76, respectively. The three variables that contributed most to predicting ICU admission were age, fever, and tachypnea with or without respiratory crackles.
Liang et al. [91] developed a risk score model to predict which patients are likely to develop a critical illness. A data set of 2300 patients from 575 hospitals was included (1590 patients for training the model and 710 other patients for validation). The study used LASSO regression to help with variable selection; 19 variables out of 72 were identified as important predictors and then used in LR models to build the risk score prediction model. Only the 10 most significant predictors were included in the risk score: CXR abnormality, age, hemoptysis, unconsciousness, number of comorbidities, cancer history, neutrophil-to-lymphocyte ratio, LDH, and direct bilirubin. The model achieved an AUC of 0.88 in both the training and validation cohorts. The risk score model is available at http://118.126.104. 170/.
Levy et al. [92] developed the NOCOS calculator, which predicts 7-day survival in COVID-19 patients. A data set of 11,095 patients (2596 non-survivors) from the Northwell Health system facilities, USA, was used. The study employed LASSO regression to build the prediction model, which achieved AUCs of 0.86, 0.82, and 0.82, respectively, for internal, external, and validation. The model found that the optimal predictors for survival were: serum BUN, age, absolute neutrophil count, RDW, SpO2, and serum sodium.
Nemati et al. [93] built a prediction model to estimate the duration of hospital stay of patients with COVID-19 using patients’ clinical information. A data set of 1182 hospitalized patients was obtained from an open-access data set collected by a group of researchers from different universities and research laboratories. Several statistical analysis methods and ML approaches were used to implement different survival analysis models. According to the results, the stagewise gradient-boosting survival model delivered the most accurate discharge-time forecast, with a C-index of 71.47. The findings indicated that discharge probabilities were lower for males and older age groups.
Li et al. [94] developed a model to predict the mortality of COVID-19. The ML algorithms GBDT, LR model, and simplified LR were trained and validated using a data set of 2924 patients including 257 non-survivors. The GBDT achieved the highest fivefold AUC of 0.941. The study found that leukomonocyte (%), urea, age, and SpO2 were the best predictors of mortality.
Gao et al. [20] developed a mortality risk prediction model for COVID-19 that uses patients’ clinical data in EHRs on admission to enable accurate and expeditious mortality risk stratification of patients up to 20 days in advance. The data set included 2520 consecutive COVID-19 patients with known outcomes (discharge or death) from two affiliated hospitals of Tongji Medical College, Huazhong University of Science and Technology, including Sino-French New City Campus of Tongji Hospital and Optical Valley Campus of Tongji Hospital, and The Central Hospital of Wuhan. The model was built using four ML methods—LR, SVM, GBDT, and NN. From 34 features, the study selected 14 for modeling, eight of which had a positive association with mortality (high risk: consciousness, male sex, sputum, BUN, respiratory rate, D-dimer, number of comorbidities, and age), whereas six features were negatively correlated with mortality (low risk: platelet count, fever, albumin, SpO2, lymphocyte, and CKD). The model was validated in an internal validation cohort and two external validation cohorts, where it achieved AUCs of 96.21%, 97.60%, and 92.46%, respectively.
Table 3, Table 4 summarize reviewed studies on the diagnosis and severity and mortality risk prediction of COVID-19, respectively.
Table 3.
Diagnosing COVID-19.
Study; outcome | Highest-weighted features | ML approaches | Sample size (no. of positive cases) | Performance |
---|---|---|---|---|
Li et al. [46]; diagnosis of COVID-19 and discrimination between influenza and COVID-19 | Age, CT scan result, temperature, lymphocyte, fever, coughing | XGBoost | 413 patients (−) | Sensitivity of 92.5% and specificity of 97.9% |
Kukar et al. [49]; diagnosis of COVID-19 | MCHC, eosinophils count, albumin, INR, prothrombin activity % | RF, DNN, and XGBoost (selected) | 5333 patients (160 positive) | AUC of 97%, sensitivity of 81.9%, specificity of 97.9% |
Bayat et al. [48]; diagnosis of COVID-19 | Ferritin, WBC, eosinophil, temperature, CRP, LDH, D-dimer, basophil count, monocyte %, AST (in descending order of importance) | XGBoost | 75,991 patients (7335 positive) | Accuracy of 86.4%, specificity of 86.8%, sensitivity of 82.4% |
Schwab et al. [95]; diagnosis of COVID-19 | MISSING arterial tactic acid, age, leukocyte count, platelets, creatinine | LR, NN, RF, SVM, and (XGBoost selected) | 5644 (556 positive) | XGBoost model achieved AUC of 0.66, sensitivity of 75%, and specificity of 49% |
Wu et al. [50]; diagnosis of COVID-19 | Total bilirubin, glucose, creatinine, LDH, CK-MB, potassium, total protein, calcium, magnesium, PDW, basophils | RF | 253 samples from 169 suspected patients (105 samples from 27 patients confirmed positive) | AUC of 99.26%, a sensitivity of 100%, and a specificity of 94.44% with an independent test set |
Brinati et al. [51]; diagnosis of COVID-19 | AST, lymphocytes, LDH, WBC, eosinophils, ALT, age | DT, ET, KNN, LR, NB, SVM, TWRF, and (RF selected) | 279 patients (177 positive) | AUC of 84%, accuracy of 82%, sensitivity of 92%, PPV of 83%, and specificity of 65% |
Tschoellitsch et al. [52] | Leukocyte count, RDW, hemoglobin, serum calcium | RF | 1528 patients (65 positive) | Accuracy of 81%, area under the ROC curve of 0.74, sensitivity of 60%, and specificity of 82% |
Trodjman et al. [55]; diagnosis of COVID-19 | Lymphocyte, eosinophil, basophil, and neutrophil cell count | Binary LR | 400 total patients (258 positive) | AUC of 88.9%, sensitivity of 80.3%, and PPV of 92.3% |
Shoer et al. [54]; diagnosis of COVID-19 | Age, gender, prior medical conditions, smoking habits, fever, sore throat, cough, shortness of breath, loss of taste or smell | LR | 43,752 surveys (498 self-reported COVID-19 positive) | AUC of 0.737 |
Joshi et al. [53]; diagnosis of COVID-19 | Neutrophil count, absolute lymphocyte count, hematocrit, male sex | LR | 2777 patients (368 PCR positive) | C-statistic of 78%, sensitivity of 86–93%, and specificity of 35–55% |
Yang et al. [45]; diagnosis of COVID-19 | LDH, ferritin, CRP, calcuim, lymphocytes | LR, DT, RF, and (GBDT, selected) | 5893 patients (1402 positive) | AUC 83.8%, sensitivity 75.8%, and specificity reached 74% with an independent data set |
Soltan et al. [56]; diagnosis of COVID-19 | Eosinophils, basophils, and CRP, calcium, presentation oxygen requirement, respiratory rate | LR, RF and (XGBDT, selected) | 114,957 patients (437 positive) | Emergency department and admissions models: AUCs of 88.1% and 87.1%, and accuracies of 92.3% and 92.5% respectively |
Alakus and Turkoglu [58]; diagnosis of COVID-19 | [not mentioned] | ANN, CNN, RNN, CNNLSTM, and CNNRNN, and (LSTM, selected) | 600 patients (80 positive) | AUC of 62.50%, accuracy of 86.66%, recall of 99.42% |
Cabitza et al. [19]; diagnosis of COVID-19 | Age, LDH, AST, CRP, calcium, fibrinogen, XDPs, WBC | RF, NB, LR, SVM, and k- KNN | 1624 patients (52% COVID-19 positive) | AUC ranged from 83% to 90% |
Goodman-Meza et al. [59]; diagnosis of COVID-19 | Inflammatory markers, especially LDH, CRP, and the combination of CRP, LDH, and ferritin | RF, LR, SVM, multilayer perceptron, stochastic gradient descent, XGBoost, and ADABoost | 1455 records (182 positive) | AUC of 91%, sensitivity of 93%, specificity of 64% |
Aljame et al. [60]; diagnosis of COVID-19 | Monocytes, platelets, leukocytes, urea, potassium, eosinophils, hemoglobin, lymphocytes, CRP (from highest to lowest) | RF, extra trees and LR as a first level, then XGBoost for the second level | 5644 patients (559 positive) | AUC of 99.38%, sensitivity of 98.72% and specificity of 99.99% |
Feng et al. [61]; diagnosis COVID-19 | Age, IL-6, systolic blood pressure, monocyte %, fever classification | LR, Ridge regularization, DT, ADABoost, and Lasso regression (selected) | 132 patients (26 positive) | AUC of 84.1% F-1 score of 0.571, recall of 1.000, specificity of 0.727, and precision of 0.400 |
Soares et al. [62]; diagnosis COVID-19 | [not mentioned]. All 16 features used: mean platelet volume, leukocytes, MCV, creatinine, red blood cells, basophils, monocytes, potassium, lymphocytes, MCHC, RDW, sodium, MCHC, eosinophils, CRP, urea | SVM, SMOTEBoost, and ensembling | 599 patients (81 positive) | Specificity of 92.16%, NPV of 95.29%, and sensitivity of 63.98% |
Table 4.
Mortality risk and severity of COVID-19 prediction.
Study | Outcome | Highest-weighted features | ML approaches | Sample size (no. of survivors and non-survivors) | Performance |
---|---|---|---|---|---|
Valid et al. [66] | Prediction of mortality and critical events | Acute Kidney Injury, LDH, tachypnea, glucose, diastolic blood pressure, CRP | XGBoost | 4098 patients (−) | AUC of 80% at 3 days 79% at 5 days, 80% at 7 days, and 81% at 10 days |
Yan et al. [67] | Prediction of mortality and critical COVID-19 | LDH, lymphocytes, hsCRP | XGBoost | 375 patients (201 survivors, 174 non-survivors) | Accuracy of 93% |
Yan et al. [68] | Prediction of mortality and critical COVID-19 | LDH, lymphocytes, hsCRP | XGBoost | 404 patients (213 survivors and 191 non-survivors) | Accuracy of 90% |
Wang et al. [69] | Prediction of mortality | Age, hsCRP, SpO2, neutrophil and lymphocyte count, D-dimer, AST, GFR | XGBoost | 296 patients (19 non-survivor) | AUC of 88% |
Rechtman et al. [70] | Prediction of mortality | Age, male sex, higher BMI, higher respiratory rate, higher heart rate, CKD | LR, XGBoost (selected) | 8770 patients (1114 non-survivors) | AUC of 86% |
Bertsimas et al. [71] | Prediction of mortality | Age, SpO2, CRP, BUN, blood creatinine | XGBoost | 3927 patients (−) | AUC ranged between 92% and 81% using three validation cohorts. |
Guan et al. [72] | Prediction of mortality | severity, age, serum levels of hs-CRP, LDH, ferritin, IL-10 | XGBoost | 1270 patients (−) | Precision ¿90%, sensitivity ¿85%, and F1 scores ¿0.90 |
Booth et al. [73] | Prediction of mortality risk | CRP, lactic acid, calcium, BUN, serum albumin | LR and SVM (selected) | 398 patients (355 survivors and 43 non-survivors from COVID-19) | AUC 93%, 91% sensitivity, and 91% specificity |
Sun et al. [74] | Prediction of critical COVID-19 | Age, GSH, CD3 ratio, total protein | SVM | 336 patients (26 severe/critical) | AUC of 97.57% |
Yao et al. [75] | Prediction of critical COVID-19 | Age, neutrophil %, calcium, monocyte %, urine test values (urine protein, red blood cells (occult), and pH (urine)) | SVM | 137 patients (75 severe) | Accuracy of 81.48% |
Zhao et al. [76] | Prediction of critical COVID-19 | IL-6, high-sensitivity cTnI, procalcitonin, hsCRP, chest distress, calcium | SVM | 172 patients (60 severe) | SVM achieved accuracy of 91.38%, sensitivity of 90% and specificity of 94% |
Schwab et al. [95] | Prediction of ICU admission | pCO2, creatinine, pH | SVM | 556 patients (35 admitted, 16 ICU) | AUC of 98%, a sensitivity of 80%, and a specificity of 96% |
Hu et al. [77] | Prediction of mortality risk | Age, hsCRP, lymphocyte count, D-dimer | PLS regression, EN model, RF, FDA, and LR (selected) | 183 patients (115 survivors and 68 non-survivors from COVID-19) | AUC of 88.1%, sensitivity of 83.9%, and specificity of 79.4% |
Zhao et al. [78] | Prediction of mortality | Heart failure, procalcitonin, LDH, COPD, SpO2, heart rate, age | LR | 641 patients (195 admitted to the ICU, 82 non-survivors) | AUC of 82% |
Zhao et al. [78] | Prediction of ICU admission | LDH, procalcitonin, smoking history, SpO2, lymphocyte count | LR | 641 patients (195 admitted to the ICU, 82 non-survivors) | AUC of 74% |
Huang et al. [79] | Prediction of critical COVID-19 | Comorbidities, respiratory rate, CRP, LDH | LR | 125 patients (32 severe) | AUC of 94.4%, sensitivity of 94.1%, and specificity of 90.2%. |
Xie et al. [80] | Prediction of mortality risk | Age, lymphocyte count, LDH, SpO2 | LR | 444 patients [299 training, 145 validation, 155/299 and 69/145 non-survivors] | (c = 0·89) and (c = 0·98) for internal and external validation. |
Zhou et al. [81] | Prediction of critical COVID-19 | Age, CRP, D-dimer, product of N/L*CRP*D-dimer | LR | 377 patients (172 severe, 106 non-severe) | AUC of 87.9%, specificity of 73.7% and sensitivity of 88.6% |
Zhu et al. [82] | Prediction of critical COVID-19 | IL-6, CRP, hypertension | LR | 127 patients (16 severe) | AUC of 90.0% |
Gong et al. [83] | Prediction of critical COVID-19 | Older age; higher LDH, CRP, RDW, BUN, and direct bilirubin; lower albumin | LASSO regression, DT, RF, and SVM, and LR (selected) | 372 patients (72 severe) | AUC of 85.3%, a sensitivity of 77.5%, and specificity of 78.4% |
Aloisio et al. [84] | Prediction of mortality and critical COVID-19 | cTnT | Univariate LR | 427 patients (89 non-survivors) | AUC of 94% |
Aloisio et al. [84] | Prediction of mortality and critical COVID-19 | LDH | Univariate LR | 427 patients (89 non-survivors) | AUC of 89% |
Aloisio et al. [84] | Prediction of mortality and critical COVID-19 | CRP | Univariate LR | 427 patients (89 non-survivors) | AUC of 87% |
Aloisio et al. [84] | Prediction of mortality and critical COVID-19 | Albumin | Univariate LR | 427 patients (89 non-survivors) | AUC of 87% |
Aloisio et al. [84] | Prediction of mortality and critical COVID-19 | D-dimer | Univariate LR | 427 patients (89 non-survivors) | AUC of 84% |
Aloisio et al. [84] | Prediction of mortality and critical COVID-19 | Ferritin | Univariate LR | 427 patients (89 non-survivors) | AUC of 77% |
Aloisio et al. [84] | Prediction of mortality and critical COVID-19 | Age, high LDH, low albumin | Multivariate LR | 427 patients (89 non-survivors) | AUC of 88%–89%. |
Liu et al [85] | Prediction of mortality | Decreased lymphocyte ratio, elevated BUN, raised D-dimer | Multivariate LR | 336 severe patients (34 non-survivors) | AUC of 99.4%, sensitivity of 100.0% and specificity of 97.2% |
Miao et al. [86] | Prediction of mortality | IL-6 and lymphocyte subsets (CD8+ T cell) | LR | 1018 patients (−) | AUC of 90.7% |
Bai et al. [96] | Prediction of mortality and the outcome | Creatine kinase | LR | 127 patients (36 non-survivors) | AUC of 86.4% |
Bai et al. [96] | Prediction of mortality and outcome | CRP | LR | 127 patients (36 non-survivors) | AUC of 87% |
Bai et al. [96] | Prediction of mortality and outcome | Ferritin | LR | 127 patients (36 non-survivors) | AUC of 83.3% |
Bai et al. [96] | Prediction of mortality and outcome | IL-6 | LR | 127 patients (36 non-survivors) | AUC of 78.1% |
Bai et al. [96] | Prediction of mortality and outcome | Lymphocyte CD3+ | LR | 127 patients (36 non-survivors) | AUC of 91.5% |
Bai et al. [96] | Prediction of mortality and outcome | LDH | LR | 127 patients (36 non-survivors) | AUC of 92.8% |
Bai et al. [96] | Prediction of mortality and outcome | Troponin I | LR | 127 patients (36 non-survivors) | AUC of 93.9% |
Bai et al. [96] | Prediction of mortality and outcome | Prothrombin time | LR | 127 patients (36 non-survivors) | AUC of 92% |
Bai et al. [96] | Prediction of mortality and outcome | Procalcitonin | LR | 127 patients (36 non-survivors) | AUC of 90% |
Han [87] | Prediction of critical COVID-19 | Gender, APACHE II, SOFA, lymphocytes (including subsets), CRP, LDH, AST, cTnT, BNP, WBC, neutrophil count, urea | LR | 47 patients (24 severe) | (not specified) |
Han [87] | Prediction of critical COVID-19 | LDH | LR | 47 patients (24 severe) | AUC of 97.27%, sensitivity 100.00% and specificity 86.67% |
Han [87] | Prediction of critical COVID-19 | AST | LR | 47 patients (24 severe) | AUC of 92.31% |
Han [87] | Predict critical COVID-19 | CPR | LR | 47 patients (24 severe) | AUC of 92.92% |
Han [87] | Prediction of critical COVID-19 | Lymphocyte counts (less than 1.045 × 109/L) | LR | 47 patients (24 severe) | AUC of 98.45%, specificity 91.30% and sensitivity 95.24% |
Han [87] | Prediction of critical COVID-19 | SOFA score | LR | 47 patients (24 severe) | AUC of 94.93% |
Han [87] | Prediction of critical COVID-19 | CT score | LR | 47 patients (24 severe) | AUC of 95.28% |
Das et al. [97] | Prediction of mortality risk | Sex, age | SVM, KNN, RF, GB, and (LR, selected) | 3524 patients (74 non-survivors) | AUC of 83% |
Li et al. [88] | Prediction of mortality risk | Having a chronic disease; gastrointestinal, kidney, cardiac, respiratory symptoms | Autoencoder, LR, RF, SVM, one-class SVM, isolation forest, local outlier factor | Two data sets: A) 28,958 patients (530 non-survivors) B) 1448 patients (123 non-survivors) | Autoencoder model achieved around 73% AUC, and 97% accuracy. |
Terwangne et al. [89] | Prediction of severity | WHO severity classification, acute kidney injury, age, LDH, lymphocytes, aPTT | Bayesian network analysis | 295 patients (−) | ROC of 83.8% and 91% for the models based on WHO classification only, and EPI-SCORE, respectively. |
Izquierdo et al. [90] | Prediction of ICU admission | Age, fever, tachypnea with or without respiratory crackles | DT | 10,504 (1353 hospitalized, 83 ICU admission) | AUC of 76%, accuracy 68%, and recall 71% |
Liang et al. [91] | Prediction of critical COVID-19 | Age, hemoptysis, unconsciousness, comorbidities, cancer history, neutrophil-to- lymphocyte ratio, LDH, direct bilirubin | LASSO then LR | 2300 patients (−) | AUCs of 88% in both the training and validation cohorts |
Levy et al. [92] | Prediction of critical COVID-19 | BUN, age, absolute neutrophil count, RDW, SpO2, serum sodium | LASSO | 11,095 patients (8499 survivors, 2596 non-survivors) | AUCs of 86%, 82%, and 82%, respectively for internal and external validation |
Nemati et al. [93] | Survival analysis and discharge time | Age, sex | stagewise GB, IPCRidge, CoxPH, Coxnet, Componentwise GB, fast SVM, and fast Kernel SVM | 1182 patients (−) | C-index of stagewise GB: 71.47 |
Li et al. [94] | Prediction of mortality | Leukomonocyte %, urea, age, SpO2 | LR, simplified LR, and (GBDT, selected) | 2924 patients (257 non-survivors) | AUC of 94.1% |
Gao et al. [20] | Prediction of mortality risk, up to 20 days in advance | Increased consciousness, male sex, sputum, BUN, respiratory rate, D-dimer, comorbidities, age. Also decreased platelet count, albumin, SpO2, lymphocytes, CKD | LR, SVM, GBDT, and NN | 2520 COVID-19 patients with known outcomes (survivors or non-survivors) | AUC ranging from 91.86% to 97.62% in an internal validation cohort and two external validation cohorts |
5. Discussion
Research efforts for the prediction of COVID-19 can be classified into statistical and data analytics methods [98]. Experimental results for various statistical and mathematical models reveal that they provide poor predictions as they are incapable of handling large amounts of data [98]. Regarding data analytics methods, the applications are mostly based on historical data only and do not consider external factors that affect the spread of the infection, such as population and median age index [98]. To construct prediction approaches for COVID-19, ML and AI techniques have been widely used.
The current paper demonstrates the great potential of ML in tackling the novel COVID-19 crisis by facilitating complex decision-making and fact interrogation. We note that the existing applications of ML methods for COVID-19 diagnosis and for severity and mortality risk prediction are based on supervised learning techniques. This is mainly because the tasks at hand are usually formulated as classification problems. Supervised learning is easy to understand and readily available on numerous data analytics platforms.
We observe that the most widely used algorithm for COVID-19 diagnostic and prognostic models is LR, followed by XGBoost, then SVM. As for prognostic models, LR was by far the most selected model. On the other hand, LR, RF, SVM, and XBOOST were all popular in the reported diagnostic models. Fig. 3 shows a summary of the ML algorithms used in the reported studies.
Fig. 3.
Summary of the ML algorithms selected by the reported studies.
Most of the work in the literature has been experimental, and the produced models have not been deployed in real-world applications. The reported ML models have exhibited promising predictive ability; however, they are impeded by several limitations. The available data sets may suffer from selection bias. The prognosis studies mostly encompass inpatients, who are usually sicker, whereas the diagnosis studies typically involve patients who already exhibit symptoms fitting with COVID-19. More data are needed on asymptomatic individuals and those with mild symptoms, who might not visit the hospital nor report their suspicion. Such individuals are still contagious and may pose a risk to the community by spreading the disease. They might also deteriorate to a severe condition before receiving the care that might have improved their outcome otherwise. Moreover, the majority the studies reviewed in this paper employed imbalanced data sets, that is, those where the majority of records in the training data set represent the negative class, and the positive class is under-represented. Thus, the reported performance of various ML algorithms applied in the context of COVID-19 may have been affected by bias. A high accuracy value in such cases could be attributed to the ability of the model to accurately identify negative samples and erroneously exclude all the positive COVID-19 cases. More effort is required to handle imbalanced data sets prior to the application of ML to COVID-19. The predictive performance of the models might also differ when using representative data that incorporates the targeted population; this merits further investigation.
In this review, we focused on ML models that used only readily available clinical and laboratory data, which could be made available quickly, i.e., within an hour. This represents a faster and cheaper diagnostic alternative to the RT-PCR test, with comparable, although inferior, performance [19].
Early identification of COVID-19 patients has huge implications for controlling the spread of the disease by isolating potential patients at an early stage. Moreover, such identification can be used to detect asymptomatic COVID-19 patients [48], where clinical abnormalities might be mild [99,100].
Although these predictions alone cannot rule out COVID-19, they might provide some insights to health care providers in areas with limited resources (e.g., limited supplies of personal protective equipment) [101]. They can also be incredibly valuable during a pandemic peak to mitigate the shortage of reference tests. Moreover, they can be used to cross-check RT-PCR tests where false-negative results are well documented [39,47,48].
In addition, early prediction of severity and mortality helps to prioritize high-risk individuals, provide them with the best possible care, and hopefully improve their outcomes. It can also reduce pressure on health care systems, support decision-making, and enable limited resources to be wisely utilized.
In order to enhance the interpretability of prediction models, reduce their complexity, and improve their accuracy, numerous studies have employed feature selection, where only a set of relevant features that contribute most to the prediction output are selected and used to build the final classification model [102]. Feature selection methods are usually divided into filter-based, wrapper-based, and hybrid methods. In filter-based feature selection methods, a feature-dependent score is calculated or a feature-searching algorithm is applied [103]. The most popular filter methods are entropy, information gain, gain ratio, Gini index, and chi-squared statistics. Although simple and fast, these methods are known to have lower accuracy compared with wrapper methods. For example, only 70% accuracy was achieved with a DT model for prediction of the clinical severity of COVID-19 using the Gini index and gain ratio [104]. In wrapper-based feature selection methods, a subset of features is selected, such that the classification error is optimized. An example of wrapper-based methods is forward selection. Forward selection begins with an empty features subset that is iteratively built until no further improvement is observed in the classification accuracy. Research suggests that wrapper-based methods have better performance than filter-based methods, while being more complex and less time efficient. Hybrid feature selection combines filter- and wrapper-based methods and is integrated with feature selection in the learning algorithm. LASSO is an example of a hybrid method for feature selection that has been used in the context of COVID-19 [61,83,91].
With the intention of highlighting commonly applied features, we explored the top-ranking features from the reviewed prediction models. Fig. 4, Fig. 5 summarize the most frequently reported predictors of COVID-19 diagnosis and those of mortality and severity, respectively.
Fig. 4.
Frequently reported features for predicting COVID-19 diagnosis.
Fig. 5.
Frequently reported features for predicting mortality and severe COVID-19.
Predictors of mortality and severe COVID-19 that were reported repeatedly were older age, male sex, comorbidities, decreased calcium, albumin, red blood cells, oxygen saturation, lymphocytopenia, increased BUN, creatinine, LDH, CRP, D-dimer, respiratory rate, neutrophil count, IL-6, procalcitonin, bilirubin, ferritin, AST, CKD, cTnI, and cTnT.
Predictors of COVID-19 diagnosis that were reported repeatedly were older age, male sex, fever, decreased calcium, hypokalemia,eosinophil, basophil, hemoglobin, WBC, platelets, lymphocytopenia, increased creatinine, LDH, CRP, CRP + LDH + ferritin, CD3, AST, leukocytes, neutrophil count, INR, and monocytes percentage.
These predictors were proven to be intimately correlated with screening for COVID-19 and predicting severe conditions and mortality [[105], [106], [107], [108], [109], [110], [111], [112]].
However, it is worth mentioning that a large number of these parameters often arise in other viral and bacterial infections, which occasionally makes it challenging to distinguish COVID-19 patients from those with other infectious diseases. This warrants further investigation.
Collecting a diverse representative data set is a principal challenge in building an ML model [21]. The lack of large-scale clinical data might negatively affect ongoing efforts to contain the novel virus. Overcoming data limitations necessitates a careful balance between data privacy and public health, as well as rigorous human–AI interaction [12]. Researchers worldwide are encouraged to release de-identified patient data to aid in data mining and ML efforts against COVID-19 [46].
Researchers have published a vast number of studies on COVID-19 in an attempt to understand and contain the disease. In order to rapidly survey the novel disease literature, Doanvo et al. [113] employed ML approaches in order to identify research differences between COVID-19 and non-COVID-19 coronaviruses. They aimed to recognize which areas the studies have focused on and which areas warrant further exploration. The results indicate an under-representation of laboratory-driven COVID-19 research. In particular, only a limited number of studies have focused on the basic microbiology of COVID-19, including its pathogenesis and transmission, compared with the distribution of previous research on other coronaviruses. COVID-19 publications have primarily focused on clinical care, public health, reporting, and testing. This suggests the need to support laboratory-driven research, including that on genetic and biomolecular topics.
6. Conclusion
In this article, we surveyed existing ML methods in the context of COVID-19. We focused on two applications that have gained much attention: the diagnosis of COVID-19 and the prediction of severity and mortality risks using readily available clinical and laboratory data. A few primary points have been highlighted by this study. First, the majority of ML algorithms used for these two applications fall under the category of supervised learning algorithms, as they are simple and easy to understand. Much of the relevant work has been experimental, and the models developed have not been implemented in real-world applications. It is worth mentioning that it is challenging at this stage to determine the best models for COVID screening. Further research is required to address this point. More importantly, there is a need to create a benchmark dataset for this purpose. Second, the diagnostic and prognostic features highlighted by the ML models were consistent with findings reported in the existing medical literature; this demonstrates the meaningfulness of using ML algorithms to comprehensively analyze patient data, recognize subtle patterns, and support fact interrogation and the decision-making process. Third, a clear limitation of existing studies is the use of data sets that are imbalanced and suffer from selection bias. Although ML algorithms have been applied to a wide range of data sets in different countries, imbalanced data sets were common in almost all of the studies. This indicates a need to investigate techniques for handling this issue and to re-evaluate the performance of state-of-art ML algorithms. In addition, the literature demonstrated the potential of integrating different types of data, such as demographic, symptom, and clinical information. However, integrating various data sets of different types (structured and unstructured) for the prediction of COVID-19 would be worth investigating further.
Declaration of competing interest
The authors declare no conflicts of interest.
Footnotes
Supplementary data related to this article can be found at https://doi.org/10.1016/j.imu.2021.100564.
Appendix A. Supplementary data
The following is the supplementary data related to this article:
References
- 1.Common human coronaviruses. 2020. https://www.cdc.gov/coronavirus/general-information.html [Google Scholar]
- 2.Paules C.I., Marston H.D., Fauci A.S. Coronavirus infections—more than just the common cold. J Am Med Assoc. 2020;323:707–708. doi: 10.1001/jama.2020.0757. [DOI] [PubMed] [Google Scholar]
- 3.Zhong N., Zheng B., Li Y., Poon L., Xie Z., Chan K., Li P., Tan S., Chang Q., Xie J., et al. Epidemiology and cause of severe acute respiratory syndrome (sars) in guangdong, people's Republic of China, in february, 2003. Lancet. 2003;362:1353–1358. doi: 10.1016/S0140-6736(03)14630-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Assiri A., McGeer A., Perl T.M., Price C.S., Al Rabeeah A.A., Cummings D.A., Alabdullatif Z.N., Assad M., Almulhim A., Makhdoom H., et al. Hospital outbreak of middle east respiratory syndrome coronavirus. N Engl J Med. 2013;369:407–416. doi: 10.1056/NEJMoa1306742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gandhi R.T., Lynch J.B., del Rio C. Mild or moderate covid-19. N Engl J Med. 2020;383:1757–1766. doi: 10.1056/NEJMcp2009249. [DOI] [PubMed] [Google Scholar]
- 6.Who coronavirus disease (covid-19) dashboard. 2021. https://covid19.who.int/
- 7.Dargan S., Kumar M., Ayyagari M.R., Kumar G. A survey of deep learning and its applications: a new paradigm to machine learning. Arch Comput Methods Eng. 2020;27:1071–1092. [Google Scholar]
- 8.Shishvan O.R., Zois D., Soyata T. Machine intelligence in healthcare and medical cyber physical systems: a survey. IEEE Access. 2018;6:46419–46494. [Google Scholar]
- 9.Ascent of machine learning in medicine. Nat Mater. 2019;18 doi: 10.1038/s41563-019-0360-1. 407–407. [DOI] [PubMed] [Google Scholar]
- 10.Agbehadji I.E., Awuzie B.O., Ngowi A.B., Millham R.C. Review of big data analytics, artificial intelligence and nature-inspired computing models towards accurate detection of COVID-19 pandemic cases and contact tracing. Int J Environ Res Publ Health. 2020;17:5330. doi: 10.3390/ijerph17155330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bullock J., Luccioni A., Pham K.H., Lam C.S.N., Luengo-Oroz M. Mapping the landscape of artificial intelligence applications against COVID-19. arXiv:2003.11336 [cs] 2020 doi: 10.1613/jair.1.12162. [DOI] [Google Scholar]
- 12.Naudé W. Artificial intelligence vs covid-19: limitations, constraints and pitfalls. AI Soc. 2020:1. doi: 10.1007/s00146-020-00978-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Albahri A.S., Hamid R.A., Alwan J.k., Al-qays Z., Zaidan A.A., Zaidan B.B., Albahri A.O.S., AlAmoodi A.H., Khlaf J.M., Almahdi E.M., Thabet E., Hadi S.M., Mohammed K.I., Alsalem M.A., Al-Obaidi J.R., Madhloom H. Role of biological data mining and machine learning techniques in detecting and diagnosing the novel coronavirus (COVID-19): a systematic review. J Med Syst. 2020;44 doi: 10.1007/s10916-020-01582-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Swapnarekha H., Behera H.S., Nayak J., Naik B. Role of intelligent computing in COVID-19 prognosis: a state-of-the-art review. Chaos, Solit Fractals. 2020;138:109947. doi: 10.1016/j.chaos.2020.109947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lalmuanawma S., Hussain J., Chhakchhuak L. Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: a review. Chaos, Solit Fractals. 2020;139:110059. doi: 10.1016/j.chaos.2020.110059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Tayarani-N M.-H. Applications of artificial intelligence in battling against covid-19: a literature review. Chaos, Solit Fractals. 2020:110338. doi: 10.1016/j.chaos.2020.110338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Wu J., Wang J., Nicholas S., Maitland E., Fan Q. Application of big data technology for COVID-19 prevention and control in China: lessons and recommendations. J Med Internet Res. 2020;22 doi: 10.2196/21980. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Frank E., Hall M.A., Witten I.H. Morgan Kaufmann; 2016. The WEKA workbench. [Google Scholar]
- 19.Cabitza F., Campagner A., Ferrari D., Di Resta C., Ceriotti D., Sabetta E., Colombini A., De Vecchi E., Banfi G., Locatelli M., et al. medRxiv; 2020. Development, evaluation, and validation of machine learning models for covid-19 detection based on routine blood tests. [DOI] [PubMed] [Google Scholar]
- 20.Gao Y., Cai G.Y., Fang W., Li H.Y., Wang S.Y., Chen L., Yu Y., Liu D., Xu S., Cui P.F., Zeng S.Q., Feng X.X., Yu R.D., Wang Y., Yuan Y., Jiao X.F., Chi J.H., Liu J.H., Li R.Y., Zheng X., Song C.Y., Jin N., Gong W.J., Liu X.Y., Huang L., Tian X., Li L., Xing H., Ma D., Li C.R., Ye F., Gao Q.L. Machine learning based early warning system enables accurate mortality risk prediction for COVID-19. Nat Commun. 2020;11:5033. doi: 10.1038/s41467-020-18684-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Rajkomar A., Dean J., Kohane I. Machine learning in medicine. N Engl J Med. 2019;380:1347–1358. doi: 10.1056/NEJMra1814259. [DOI] [PubMed] [Google Scholar]
- 22.Chen J.H., Asch S.M. Machine learning and prediction in medicine—beyond the peak of inflated expectations. N Engl J Med. 2017;376:2507. doi: 10.1056/NEJMp1702071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Cortes C., Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–297. [Google Scholar]
- 24.Quinlan J.R. Simplifying decision trees. Int J Man Mach Stud. 1987;27:221–234. [Google Scholar]
- 25.Han J., Kamber M. third ed. Morgan Kaufmann; 2011. Data mining: concepts and techniques, the morgan kaufmann series in data management systems. [Google Scholar]
- 26.Tin Kam Ho The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20:832–844. [Google Scholar]
- 27.Freund o., Schapire R. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55:119–139. [Google Scholar]
- 28.Altman N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am Statistician. 1992;46:175–185. [Google Scholar]
- 29.Friedman J.H. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–1232. [Google Scholar]
- 30.Geurts P., Ernst D., Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63:3–42. [Google Scholar]
- 31.Natekin A., Knoll A. Gradient boosting machines, a tutorial. Front Neurorob. 2013;7 doi: 10.3389/fnbot.2013.00021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Geurts P., Louppe G. PMLR; Haifa, Israel: 2011. Learning to rank with extremely randomized trees, volume 14 of Proceedings of Machine Learning Research; pp. 49–61. [Google Scholar]
- 33.Barboza F., Kimura H., Altman E. Machine learning models and bankruptcy prediction. Expert Syst Appl. 2017;83:405–417. [Google Scholar]
- 34.Saqlain M., Jargalsaikhan B., Lee J.Y. A voting ensemble classifier for wafer map defect patterns identification in semiconductor manufacturing. IEEE Trans Semicond Manuf. 2019;32:171–182. [Google Scholar]
- 35.Zhu L., Yu F.R., Wang Y., Ning B., Tang T. Big data analytics in intelligent transportation systems: a survey. IEEE Trans Intell Transport Syst. 2019;20:383–398. [Google Scholar]
- 36.Samin H., Azim T. Knowledge based recommender system for academia using machine learning: a case study on higher education landscape of Pakistan. IEEE Access. 2019;7:67081–67093. [Google Scholar]
- 37.Collins G.S., Moons K.G. Reporting of artificial intelligence prediction models. Lancet. 2019;393:1577–1579. doi: 10.1016/S0140-6736(19)30037-6. [DOI] [PubMed] [Google Scholar]
- 38.Corman V.M., Landt O., Kaiser M., Molenkamp R., Meijer A., Chu D.K., Bleicker T., Brünink S., Schneider J., Schmidt M.L., Mulders D.G., Haagmans B.L., van der Veer B., van den Brink S., Wijsman L., Goderski G., Romette J.-L., Ellis J., Zambon M., Peiris M., Goossens H., Reusken C., Koopmans M.P., Drosten C. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Euro Surveill. 2020;25 doi: 10.2807/1560-7917.ES.2020.25.3.2000045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Surkova E., Nikolayevskyy V., Drobniewski F. False-positive COVID-19 results: hidden problems and costs, the Lancet. Respir Med. 2020;8(12):1167–1168. doi: 10.1016/S2213-2600(20)30453-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Rezaeijo S.M., Ghorvei M., Alaei M. 2020 6th Iranian conference on signal processing and intelligent systems (ICSPIS) IEEE; 2020. A machine learning method based on lesion segmentation for quantitative analysis of ct radiomics to detect covid-19; pp. 1–5. [Google Scholar]
- 41.Hao W., Li M. 2020. Clinical diagnostic value of ct imaging in covid-19 with multiple negative rt-pcr testing, Travel medicine and infectious disease. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Rezaeijo M., Abedi-Firouzjah R., Ghorvei M., Sarnameh S. Screening of covid-19 based on the extracted radiomics features from chest ct images. J X Ray Sci Technol. 2021;29(2):229–243. doi: 10.3233/XST-200831. [DOI] [PubMed] [Google Scholar]
- 43.Shaverdian N., Shepherd A.F., Rimner A., Wu A.J., Simone C.B., et al. Need for caution in the diagnosis of radiation pneumonitis during the covid-19 pandemic. Advances in radiation oncology. 2020;5(4):617–620. doi: 10.1016/j.adro.2020.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Acr recommendations for the use of chest radiography and computed tomography (ct) for suspected covid-19 infection. 2020. https://www.acr.org/Advocacy-and-Economics/ACR-Position-Statements/Recommendations-for-Chest-Radiography-and-CT-for-Suspected-COVID19-Infection
- 45.Yang H.S., Hou Y., Vasovic L.V., Steel P., Chadburn A., Racine-Brzostek S.E., et al. Routine laboratory blood tests predict sars-cov-2 infection using machine learning. medRxiv. 2020;66(11):1396–1404. doi: 10.1093/clinchem/hvaa200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Li W.T., Ma J., Shende N., Castaneda G., Chakladar J., Tsai J.C., Apostol L., Honda C.O., Xu J., Wong L.M., Zhang T., Lee A., Gnanasekar A., Honda T.K., Kuo S.Z., Yu M.A., Chang E.Y., Rajasekaran M.R., Ongkeko W.M. Using machine learning of clinical data to diagnose covid-19: a systematic review and meta-analysis. BMC Med Inf Decis Making. 2020;20:247. doi: 10.1186/s12911-020-01266-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.West C.P., Montori V.M., Sampathkumar P. vol. 95. Elsevier; 2020. Covid-19 testing: the threat of false-negative results; pp. 1127–1129. (Mayo clinic proceedings). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Bayat V., Phelps S., Ryono R., Lee C., Parekh H., Mewton J., et al. A severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) prediction model from standard laboratory tests. Clin Infect Dis. 2020:ciaa1175. doi: 10.1093/cid/ciaa1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Kukar M., Gunčar G., Vovko T., Podnar S., Černelč P., Brvar M., Zalaznik M., Notar M., Moškon S., Notar M. 2020. Covid-19 diagnosis by routine blood tests using machine learning. arXiv preprint arXiv:2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Wu J., Zhang P., Zhang L., Meng W., Li J., Tong C., Li Y., Cai J., Yang Z., Zhu J., Zhao M., Huang H., Xie X., Li S. Rapid and accurate identification of covid-19 infection through machine learning based on clinical available blood test results. medRxiv. 2020 doi: 10.1101/2020.04.02.20051136. [DOI] [Google Scholar]
- 51.Brinati D., Campagner A., Ferrari D., Locatelli M., Banfi G., Cabitza F. Detection of COVID-19 infection from routine blood exams with machine learning: a feasibility study. J Med Syst. 2020;44:135. doi: 10.1007/s10916-020-01597-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Tschoellitsch T., Dünser M., Böck C., Schwarzbauer K., Meier J. Machine learning prediction of sars-cov-2 polymerase chain reaction results with routine blood tests. Lab Med. 2020;52(2):146–149. doi: 10.1093/labmed/lmaa111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Joshi R.P., Pejaver V., Hammarlund N.E., Sung H., Lee S.K., Furmanchuk A., Lee H.-Y., Scott G., Gombar S., Shah N., Shen S., Nassiri A., Schneider D., Ahmad F.S., Liebovitz D., Kho A., Mooney S., Pinsky B.A., Banaei N. A predictive tool for identification of sars-cov-2 pcr-negative emergency department patients using routine test results. J Clin Virol. 2020;129:104502. doi: 10.1016/j.jcv.2020.104502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Shoer S., Karady T., Keshet A., Shilo S., Rossman H., Gavrieli A., et al. A prediction model to prioritize individuals for sars-cov-2 test built from national symptom surveys. Med. 2020;2(2):196–208. doi: 10.1016/j.medj.2020.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Tordjman M., Mekki A., Mali R.D., Saab I., Chassagnon G., Guillo E., Burns R., Eshagh D., Beaune S., Madelin G., et al. medRxiv; 2020. Pre-test probability for sars-cov-2-related infection score: the paris score. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Soltan A.A., Kouchaki S., Zhu T., Kiyasseh D., Taylor T., Hussain Z.B., Peto T., Brent A.J., Eyre D.W., Clifton D. medRxiv; 2020. Artificial intelligence driven assessment of routinely collected healthcare data is an effective screening test for covid-19 in patients presenting to hospital. [Google Scholar]
- 57.de Moraes Batista A.F., Miraglia J.L., Donato T.H.R., Chiavegatto Filho A.D.P. medRxiv; 2020. Covid-19 diagnosis prediction in emergency care patients: a machine learning approach. [Google Scholar]
- 58.Alakus T.B., Turkoglu I. Comparison of deep learning approaches to predict covid-19 infection. Chaos, Solit Fractals. 2020;140:110120. doi: 10.1016/j.chaos.2020.110120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Goodman-Meza D., Rudas A., Chiang J.N., Adamson P.C., Ebinger J., Sun N., Botting P., Fulcher J.A., Saab F.G., Brook R., Eskin E., An U., Kordi M., Jew B., Balliu B., Chen Z., Hill B.L., Rahmani E., Halperin E., Manuel V. A machine learning algorithm to increase COVID-19 inpatient diagnostic capacity. PloS One. 2020;15 doi: 10.1371/journal.pone.0239474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.AlJame M., Ahmad I., Imtiaz A., Mohammed A. Ensemble learning model for diagnosing COVID-19 from routine blood tests. Inform Med Unlocked. 2020;21:100449. doi: 10.1016/j.imu.2020.100449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Feng C., Huang Z., Wang L., Chen X., Zhai Y., Zhu F., Chen H., Wang Y., Su X., Huang S., et al. 2020. A novel triage tool of artificial intelligence assisted diagnosis aid system for suspected covid-19 pneumonia in fever clinics. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Soares F., Villavicencio A., Anzanello M.J., Fogliatto F.S., Idiart M., Stevenson M. A novel high specificity covid-19 screening method based on simple blood exams and artificial intelligence. medRxiv. 2020 doi: 10.1101/2020.04.10.20061036. [DOI] [Google Scholar]
- 63.Frost F., Bradley P., Tharmaratnam K., Wootton D.G., et al. medRxiv; 2020. The utility of established prognostic scores in covid-19 hospital admissions: a multicentre prospective evaluation of curb-65, news2, and qsofa. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Satici C., Demirkol M.A., Sargin Altunok E., Gursoy B., Alkan M., Kamat S., Demirok B., Surmeli C.D., Calik M., Cavus Z., Esatoglu S.N. Performance of pneumonia severity index and curb-65 in predicting 30-day mortality in patients with covid-19. Int J Infect Dis. 2020;98:84–89. doi: 10.1016/j.ijid.2020.06.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Pollack M.M. Severity of illness confusion, pediatric critical care medicine. a journal of the Society of Critical Care Medicine and the World Federation of Pediatric Intensive and Critical Care Societies. 2016;17:583. doi: 10.1097/PCC.0000000000000732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Vaid A., Somani S., Russak A.J., De Freitas J.K., Chaudhry F.F., Paranjpe I., et al. Machine learning to predict mortality and critical events in covid-19 positive New York city patients: a cohort study. J Med Internet Res. 2020;49(6):1918–1929. doi: 10.2196/24018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Yan L., Zhang H.-T., Xiao Y., Wang M., Sun C., Liang J., Li S., Zhang M., Guo Y., Xiao Y., et al. MedRxiv; 2020. Prediction of criticality in patients with severe covid-19 infection using three clinical features: a machine learning-based prognostic model with clinical data in wuhan. [Google Scholar]
- 68.Yan L., Zhang H.-T., Goncalves J., Xiao Y., Wang M., Guo Y., Sun C., Tang X., Jin L., Zhang M., et al. MedRxiv; 2020. A machine learning-based model for survival prediction in patients with severe covid-19 infection. [Google Scholar]
- 69.Wang K., Zuo P., Liu Y., Zhang M., Zhao X., Xie S., Zhang H., Chen X., Liu C. Clinical Infectious Diseases; 2020. Clinical and laboratory predictors of in-hospital mortality in patients with coronavirus disease-2019: a cohort study in wuhan, China. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Rechtman E., Curtin P., Navarro E., Nirenberg S., Horton M.K. Vital signs assessed in initial clinical encounters predict covid-19 mortality in an nyc hospital system. Sci Rep. 2020;10:1–6. doi: 10.1038/s41598-020-78392-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Bertsimas D., Lukin G., Mingardi L., Nohadani O., Orfanoudaki A., Stellato B., Wiberg H., Gonzalez-Garcia S., Parra-Calderon C.L., Robinson K., et al. Covid-19 mortality risk assessment: an international multi-center study. PloS One. 2020;15 doi: 10.1371/journal.pone.0243262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Guan X., Zhang B., Fu M., Li M., Yuan X., Zhu Y., Peng J., Guo H., Lu Y. Clinical and inflammatory features based machine learning model for fatal risk prediction of hospitalized covid-19 patients: results from a retrospective cohort study. Ann Med. 2021;53:257–266. doi: 10.1080/07853890.2020.1868564. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Booth A.L., Abels E., McCaffrey P. Development of a prognostic model for mortality in covid-19 infection using machine learning. Mod Pathol. 2020:1–10. doi: 10.1038/s41379-020-00700-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Sun L., Song F., Shi N., Liu F., Li S., Li P., Zhang W., Jiang X., Zhang Y., Sun L., Chen X., Shi Y. Combination of four clinical indicators predicts the severe/critical symptom of patients infected covid-19. J Clin Virol. 2020;128:104431. doi: 10.1016/j.jcv.2020.104431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Yao H., Zhang N., Zhang R., Duan M., Xie T., Pan J., Peng E., Huang J., Zhang Y., Xu X., et al. Severity detection for the coronavirus disease 2019 (covid-19) patients using a machine learning model based on the blood and urine tests. Frontiers in cell and developmental biology. 2020;8:683. doi: 10.3389/fcell.2020.00683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Zhao C., Bai Y., Wang C., Zhong Y., Lu N., Tian L., Cai F., Jin R. Risk factors related to the severity of covid-19 in wuhan. Int J Med Sci. 2021;18:120–127. doi: 10.7150/ijms.47193. https://www.medsci.org/v18p0120.htm. doi:10.7150/ijms.47193 URL: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Hu C., Liu Z., Jiang Y., Shi O., Zhang X., Xu K., et al. Early prediction of mortality risk among patients with severe COVID-19, using machine learning. Int J Epidemiol. 2020;49(6):1918–1929. doi: 10.1093/ije/dyaa171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Zhao Z., Chen A., Hou W., Graham J.M., Li H., Richman P.S., Thode H.C., Singer A.J., Duong T.Q. Prediction model and risk scores of icu admission and mortality in covid-19. PloS One. 2020;15 doi: 10.1371/journal.pone.0236618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Huang H., Cai S., Li Y., Li Y., Fan Y., Li L., Lei C., Tang X., Hu F., Li F., Deng X. Prognostic factors for covid-19 pneumonia progression to severe symptoms based on earlier clinical features: a retrospective analysis. Front Med. 2020;7:643. doi: 10.3389/fmed.2020.557453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Xie J., Hungerford D., Chen H., Abrams S.T., Li S., Wang G., Wang Y., Kang H., Bonnett L., Zheng R., et al. 2020. Development and external validation of a prognostic multivariable model on admission for hospitalized patients with covid-19. [Google Scholar]
- 81.Zhou Y., Yang Z., Guo Y., Geng S., Gao S., Ye S., Hu Y., Wang Y. medRxiv; 2020. A new predictor of disease severity in patients with covid-19 in wuhan, China. [DOI] [Google Scholar]
- 82.Zhu Z., Cai T., Fan L., Lou K., Hua X., Huang Z., Gao G. Clinical value of immune-inflammatory parameters to assess the severity of coronavirus disease 2019. Int J Infect Dis. 2020;95:332–339. doi: 10.1016/j.ijid.2020.04.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Gong J., Ou J., Qiu X., Jie Y., Chen Y., Yuan L., Cao J., Tan M., Xu W., Zheng F., et al. Clinical infectious diseases; 2020. A tool to early predict severe corona virus disease 2019 (covid-19): a multicenter study using the risk nomogram in wuhan and guangdong, China. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Aloisio E., Chibireva M., Serafini L., Pasqualetti S., Falvella F.S., Dolci A., Panteghini M. Archives of Pathology and Laboratory Medicine; 2020. A comprehensive appraisal of laboratory biochemistry tests as major predictors of COVID-19 severity. [DOI] [PubMed] [Google Scholar]
- 85.Liu Q., Song N.C., Zheng Z.K., Li J.S., Li S.K. Laboratory findings and a combined multifactorial approach to predict death in critically ill patients with covid-19: a retrospective study. Epidemiol Infect. 2020;148:e129. doi: 10.1017/S0950268820001442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Luo M., Liu J., Jiang W., Yue S., Liu H., Wei S. Il-6 and cd8+ t cell counts combined are an early predictor of in-hospital mortality of patients with covid-19. JCI Insight. 2020;5 doi: 10.1172/jci.insight.139024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Han Y., Zhang H., Mu S., Wei W., Jin C., Xue Y., Tong C., Zha Y., Song Z., Gu G. Lactate dehydrogenase, a risk factor of severe covid-19 patients. medRxiv. 2020 doi: 10.1101/2020.03.24.20040162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Li Y., Horowitz M.A., Liu J., Chew A., Lan H., Liu Q., Sha D., Yang C. Individual-level fatality prediction of covid-19 patients using ai methods. Frontiers in Public Health. 2020;8:566. doi: 10.3389/fpubh.2020.587937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.de Terwangne C., Laouni J., Jouffe L., Lechien J.R., Bouillon V., Place S., Capulzini L., Machayekhi S., Ceccarelli A., Saussez S., et al. Predictive accuracy of covid-19 world health organization (who) severity classification and comparison with a bayesian-method-based severity score (epi-score) Pathogens. 2020;9:880. doi: 10.3390/pathogens9110880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Izquierdo J.L., Ancochea J., Soriano J.B. Clinical characteristics and prognostic factors for intensive care unit admission of patients with covid-19: retrospective study using machine learning and natural language processing. J Med Internet Res. 2020;22 doi: 10.2196/21801. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Liang W., Liang H., Ou L., Chen B., Chen A., Li C., et al. Development and validation of a clinical risk score to predict the occurrence of critical illness in hospitalized patients with covid-19. JAMA Internal Medicine. 2020;180(8):1081–1089. doi: 10.1001/jamainternmed.2020.2033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Levy T.J., Richardson S., Coppa K., Barnaby D.P., McGinn T., Becker L.B., Davidson K.W., Cohen S.L., Hirsch J.S., Zanos T. medRxiv; 2020. Development and validation of a survival calculator for hospitalized patients with covid-19. [DOI] [Google Scholar]
- 93.Nemati M., Ansary J., Nemati N. Machine-learning approaches in covid-19 survival analysis and discharge-time likelihood prediction using clinical data. Patterns. 2020;1:100074. doi: 10.1016/j.patter.2020.100074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Li S., Lin Y., Zhu T., Fan M., Xu S., Qiu W., Chen C., Li L., Wang Y., Yan J., et al. Development and external evaluation of predictions models for mortality of covid-19 patients using machine learning method. Neural Comput Appl. 2020:1–10. doi: 10.1007/s00521-020-05592-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Schwab P., Schütte A.D., Dietz B., Bauer S. 2020. predcovid-19: a systematic study of clinical predictive models for coronavirus disease 2019. arXiv preprint arXiv:2005.08302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Bai T., Tu S., Wei Y., Xiao L., Jin Y., Zhang L., Song J., Liu W., Zhu Q., Yang L., et al. 2020. Clinical and laboratory factors predicting the prognosis of patients with covid-19: an analysis of 127 patients in wuhan, China. China (2/26/2020) [Google Scholar]
- 97.Das A.K., Mishra S., Gopalan S.S. Predicting covid-19 community mortality risk using machine learning and development of an online prognostic tool. PeerJ. 2020;8 doi: 10.7717/peerj.10083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Eltoukhy A.E.E., Shaban I.A., Chan F.T.S., Abdel-Aal M.A.M. Data analytics for predicting covid-19 cases in top affected countries: observations and recommendations. Int J Environ Res Publ Health. 2020;17:7080. doi: 10.3390/ijerph17197080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Hu Z., Song C., Xu C., Jin G., Chen Y., Xu X., Ma H., Chen W., Lin Y., Zheng Y., et al. Clinical characteristics of 24 asymptomatic infections with covid-19 screened among close contacts in nanjing, China. Sci China Life Sci. 2020;63:706–711. doi: 10.1007/s11427-020-1661-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Wang Y., Liu Y., Liu L., Wang X., Luo N., Li L. Clinical outcomes in 55 patients with severe acute respiratory syndrome coronavirus 2 who were asymptomatic at hospital admission in shenzhen, China. J Infect Dis. 2020;221:1770–1774. doi: 10.1093/infdis/jiaa119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Cohen P., Blau J. UpToDate [Internet; 2020. Coronavirus disease 2019 (covid-19): outpatient evaluation and management in adults. [Google Scholar]
- 102.Cai J., Luo J., Wang S., Yang S. Feature selection in machine learning: a new perspective. Neurocomputing. 2018;300:70–79. [Google Scholar]
- 103.Cai J., Luo J., Wang S., Yang S. Feature selection in machine learning: a new perspective. Neurocomputing. 2018;300:70–79. [Google Scholar]
- 104.Jiang X., Coffee M., Bari A., Wang J., Jiang X., Huang J., Shi J., Dai J., Cai J., Zhang T., Wu Z., He G., Huang Y. Towards an artificial intelligence framework for data-driven prediction of coronavirus clinical severity. Comput Mater Continua (CMC) 2020;63:537–551. [Google Scholar]
- 105.Guan W.-j., Ni Z.-y., Hu Y., Liang W.-h., Ou C.-q., He J.-x., Liu L., Shan H., Lei C.-l., Hui D.S., et al. Clinical characteristics of coronavirus disease 2019 in China. N Engl J Med. 2020;382:1708–1720. doi: 10.1056/NEJMoa2002032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Zhou F., Yu T., Du R., Fan G., Liu Y., Liu Z., et al. Clinical course and risk factors for mortality of adult inpatients with covid-19 in wuhan, China: a retrospective cohort study. The lancet. 2020 doi: 10.1016/S0140-6736(20)30566-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Wang D., Hu B., Hu C., Zhu F., Liu X., Zhang J., Wang B., Xiang H., Cheng Z., Xiong Y., et al. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus–infected pneumonia in wuhan, China. Jama. 2020;323:1061–1069. doi: 10.1001/jama.2020.1585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Merad M., Martin J.C. Pathological inflammation in patients with covid-19: a key role for monocytes and macrophages. Nat Rev Immunol. 2020:1–8. doi: 10.1038/s41577-020-0331-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Huang C., Wang Y., Li X., Ren L., Zhao J., Hu Y., Zhang L., Fan G., Xu J., Gu X., et al. Clinical features of patients infected with 2019 novel coronavirus in wuhan, China. The lancet. 2020;395:497–506. doi: 10.1016/S0140-6736(20)30183-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110.Wu Z., McGoogan J.M. Characteristics of and important lessons from the coronavirus disease 2019 (covid-19) outbreak in China: summary of a report of 72 314 cases from the Chinese center for disease control and prevention. Jama. 2020;323:1239–1242. doi: 10.1001/jama.2020.2648. [DOI] [PubMed] [Google Scholar]
- 111.Ruan Q., Yang K., Wang W., Jiang L., Song J. Clinical predictors of mortality due to covid-19 based on an analysis of data of 150 patients from wuhan, China. Intensive Care Med. 2020;46:846–848. doi: 10.1007/s00134-020-05991-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Ji D., Zhang D., Xu J., Chen Z., Yang T., Zhao P., Chen G., Cheng G., Wang Y., Bi J., et al. Clinical Infectious Diseases; 2020. Prediction for progression risk in patients with covid-19 pneumonia: the call score. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Doanvo A., Qian X., Ramjee D., Piontkivska H., Desai A., Majumder M. Patterns; 2020. Machine learning maps research needs in covid-19 literature; p. 100123. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.