Skip to main content
Health Information Science and Systems logoLink to Health Information Science and Systems
. 2024 Mar 6;12(1):17. doi: 10.1007/s13755-024-00276-9

Efficient management of pulmonary embolism diagnosis using a two-step interconnected machine learning model based on electronic health records data

Soroor Laffafchi 1, Ahmad Ebrahimi 2,, Samira Kafan 3
PMCID: PMC10917730  PMID: 38464464

Abstract

Pulmonary Embolism (PE) is a life-threatening clinical disease with no specific clinical symptoms and Computed Tomography Angiography (CTA) is used for diagnosis. Clinical decision support scoring systems like Wells and rGeneva based on PE risk factors have been developed to estimate the pre-test probability but are underused, leading to continuous overuse of CTA imaging. This diagnostic study aimed to propose a novel approach for efficient management of PE diagnosis using a two-step interconnected machine learning framework directly by analyzing patients' Electronic Health Records data. First, we performed feature importance analysis according to the result of LightGBM superiority for PE prediction, then four state-of-the-art machine learning methods were applied for PE prediction based on the feature importance results, enabling swift and accurate pre-test diagnosis. Throughout the study patients' data from different departments were collected from Sina educational hospital, affiliated with the Tehran University of medical sciences in Iran. Generally, the Ridge classification method obtained the best performance with an F1 score of 0.96. Extensive experimental findings showed the effectiveness and simplicity of this diagnostic process of PE in comparison with the existing scoring systems. The main strength of this approach centered on PE disease management procedures, which would reduce avoidable invasive CTA imaging and be applied as a primary prognosis of PE, hence assisting the healthcare system, clinicians, and patients by reducing costs and promoting treatment quality and patient satisfaction.

Keywords: Electronic health records data, Machine learning, Pulmonary embolism, Computed tomography angiography, Applications in healthcare system

Introduction

Pulmonary embolism (PE) is an emergency and life-threatening clinical problem [1, 2] caused by emboli lodging in pulmonary arteries, resulting in poor oxygen exchange to vital organs [3]. PE is the second most common disease in sudden-unexpected deaths, accounting for 15% of nosocomial deaths [4]. PE is associated with a high fatal vascular disease with approximately 100,000 death per year [5]. Studies show that a quarter of patients involved with PE die without warning [5], the mortality rate is up to 30% while when treated timely, the rate decrease to 8% [6]. Also, studies demonstrate a higher risk of PE in COVID-19 patients [7], with a reported frequency of 16.5% in patients with severe COVID-19 infection, rising to 24.7% in cases hospitalized in the Intensive Care Unit(ICU) [8]. Recent studies have documented an increased risk of PE in COVID-19 patients in one-third of critically ill cases requiring ICU [9]. So far, PE has been called “the great masquerader” [10], because the symptoms of it assumed like other diseases and there were no particular clinical symptoms or laboratory diagnosis tests [6]. Studies show that a quarter of patients involved with PE die without any warning, so PE is among the most delayed and missed diseases, with approximately 180,000 deaths per year in the United States [4]. Evidence indicates the importance of developing rapid and accurate identification of patients with PE because prompt recognition and swift initiation therapeutic substantially reduce death incidences [11].

The clinical diagnosis of PE is complicated, time-consuming, and challenging [3, 7]. Physicians utilize multiple scoring systems like Wells and revised Geneva (rGeneva) to supplement diagnosis but it is underused, because of their low efficiency. Computed Tomography Angiography (CTA) is the gold standard for PE diagnosis [1, 7, 12]. In recent years, the number of imaging tests performed to diagnosis has been increased significantly [1, 8]. However, reports indicate that only 10% or even less of the imaging results are positive, and one-third of them are avoidable, which costs extra for the patient and the healthcare system [13]. Imaging carries risks due to intravenous contrast material, radiation exposure, contrast-induced nephropathy, and allergic reactions [1]. Additionally, there is a lot of pressure on hospitals to provide 24/7 services to perform the CTA quickly and deliver the results to the doctors [7].

To manage the use of CTA, several systematic risk scores have been developed for Clinical Decision Support(CDS), that are largely based on rules [1]. However, by estimating the effects of using these scoring systems, no overall improvement in the optimal use of CTA has been observed [1], leading to an increase in the use of CTA examinations [14, 15]. Therefore, it can be crucial to model and predict the PE before performing CTA imaging, to improve the efficiency of the healthcare system; which leads to decreasing unnecessary imaging and better managing healthcare resources. To overcome that, as a subset of artificial intelligence, Machine Learning (ML) plays a vital role in improving the quality of healthcare and medical diagnosis [16, 17]. In recent decades, researchers have rushed to develop ML algorithms to support clinicians. Learning from real-world health data has proven effective in many healthcare applications, resulting in improved quality of care [18]. Researchers and companies are also working assiduously on this issue; Google for example, cooperates with healthcare systems to build prediction models from big data to warn of high-risk diseases such as cancer, sepsis, or heart failure [19]. Moreover, several investigations have proposed Computer-Aided Diagnosis(CAD) systems [20] in the study of Alzheimer [2124], Cancer [25, 26], Pneumonia [8, 11, 27, 28], COVID [7, 2933], Renal failure [34, 35], etc. Different ML methods were applied to heal healthcare problems like neural network and deep learning, natural language processing, rule-based expert systems, and image processing [6, 18, 30].

Several studies have been conducted on PE diagnosis. By comprehensive literature review, the articles on PE disease are divided into two parts: articles based on CT angiography images and articles based on clinical data.

For example deep convolutional neural networks models were applied to detect emboli from images without the help of radiologists, the system obtained a per-embolus sensitivity of 68% [3]; It was also found that a deep learning model can predict PE automatically on volumetric CTA scans [4]. In another study, DL-assisted detection of PE in CTAs on temporal metrics of patient management were done, the results support the assumption that DL-based detection of PE has good diagnostic accuracy, with a sensitivity of 79% [36]. Also, the deep learning for pulmonary embolism detection on computed tomography pulmonary angiogram were applied and provided an approach for identifying PE on CTPA with a sensitivity of 0.88 and specificity of 0.86, they claim that a deep learning system can serve as a second interpreter [37]. In another study, researchers proposed a knowledge-based hybrid learning algorithm to classify PE with the modified criteria of the prospective investigation of PE diagnosis to compare their method with other ML algorithms like Bayes and decision Tree, and reported successful results [38]; Banerjee et al. developed ML model to generate an accurate patient-specific risk score for PE diagnosis, they used ElasticNet and neural model [1]; In another study, the researcher used ML to predict 30-day mortality in patients diagnosed with acute PE and XGBoost was the best performing model with an 0.92 AUC [39]; Also in a reserch, 3 machine learning models developed and compared their performance for predicting in-hospital PE adverse outcomes, namely the Gradient boosting model, the Deep learning model, and the Logistic regression model. Of the three methods, gradient boosting achieved the highest AUC, and they claimed that their ML-based model which are based on the pile of electronic medical records, including echocardiographic parameters and laboratory data has better results for prediction of in-hospital adverse clinical outcomes in patients with PE [40]; A research also proposed a two-stage hierarchical ML model for VTE risk prediction in patients from multiple departments [2]. As referenced, these works performed functional tasks in different frameworks. All the methods above have limitations, the conducted research for PE diagnosis was based on CTA imaging which required a huge amount of image data to train the model, so they faced this gap that required to do unnecessary imaging and imposed extra pressure on the health systems and patients. On the other hand, restricted features exclusive to current scoring systems criteria have been applied in conducted research. Analysis of the result of previous studies may be additionally affected by bias due to the absence of more features and no attempt has been made to extend comprehensive risk factors. All this evidence shows the necessity of conducting extensive research based on medical data stored in Electronic Health Records (EHR) and evaluating other risk factors [2, 30, 31, 33]. The survey of previous literature shows the potential of this machine learning approach for swift recognition and better management of hospital resources for PE diagnosis by reducing unneeded CTA imaging.

The purpose of this study is to use EHR data, including demographics, medications, laboratory test results, comorbidities, and past medical history to provide an accurate prediction of PE to manage PE diagnosis protocol. To these ends, in this paper, we propose an interconnected machine learning model based on feature analysis and selection according to the Light Gradient Boosting Machine (LightGBM) feature importance analysis due to its superiority, and then applied different ML methods based on feature importance results for PE prediction. As a result, it can better inform patients’ needs for imaging to prevent unnecessary CTA for PE diagnosis and can assist the healthcare system, clinicians, payers, patients, and regulatory bodies by reducing costs and decreasing the complicatedness of diagnosis. The experimental results illustrate the superiority of our interconnected ML model. The main contributions of our work are summarized as follows:

  • Developing a real dataset composed of EHR data of suspicious PE patients for classification experiments.

  • Evaluating more risk factors for broad analysis for PE diagnosis by conducting a large number of studies and extracting important features by feature importance machine learning analysis.

  • Propose an interconnected machine learning model to facilitate and accelerate PE diagnosis protocol, based on EHR data to avoid unnecessary CTA imaging and help clinicians to evaluate the risk of PE to improve therapeutic decisions and prescribe supportive measures to facilitate the rapid and timely diagnosis.

The remainder of this article is organized as follows. In the second part, the different stages of the proposed model were explained in detail, and the overall framework of the proposed model is described. In Section "Results", the results of our interconnected models have been illustrated and analyzed. The following discussion and conclusion were drawn in Sections "Discussion" and "Conclusion".

Material and methods

Ethics approval

This research was approved by Sina educational hospital affiliated with the Tehran University of Medical Sciences in Iran. All consultations were performed by pulmonologists and radiologists before the test and also, conscious consent was obtained from all patients.

Cohort characteristic

The cohort characteristic was determined as follows, provided that all these conditions exist. Inclusion criteria: (a) Inpatients and outpatients with suspected PE between March 1, 2019 and March 1, 2021; (b) Patients with valid hospital health medical records; (c) Patients who have been undergo to CTA imaging and their reports are clear and available; (d) Patients for whom laboratory test results are available. Exclusion criteria: (a) Lack of patients EHR data; (b) Lack of access to laboratory test results or incomplete information; (c) Lack of clarity in CTA imaging results and unspecified results; (d) Patients who were unable to carry out angiographic imaging for some reasons.

We have aggregated the data from the admission time to the day of CTA imaging have been done. Although, for some features we have to collect data at the exact time since only data before the day or accurately on the day of CTA imaging were important. Determining the right time to collect data was specified by pulmonologists. Results of CTA imaging were classified and confirmed as two binary class labels (PE absent or PE present) by radiologists.

Risk factors: Real dataset created to predict PE according to diagnostic strategies. Medical records were extracted, including clinical data, demographic data, laboratory evaluations, patients’ past medical history, medications, and CTA reports. Moreover, CTA images results have been reviewed and confirmed by the radiologist. The risk factors used in our modeling and forecasting method were extracted from guidelines and references like UpToDate, Harrison, and Murray or reported as relevant risk factors in the literature and approved by pulmonologists.

Data collection

In this work, with the approval of the Department of Pulmonary Medicine of Tehran University of Medical Sciences, the clinical data of 925 patients at Sina educational hospital were analyzed. Sina Hospital has established a comprehensive clinical platform that integrates EHR databases from various departments such as surgery, ICU, COVID, outpatients, etc. EHRs are an essential part of healthcare technology [41] and include age, gender, smoking, dyspnea, fever, used prophylaxis, laboratory test results (D-Dimer, Na, Bun, Platelet, Alb, etc.), and patient's medical history, medications, etc.

Among the available parameters, many features were changing over time, so we called them dynamic [42, 43], such as some laboratory test results, saturation, heart rate, body temperature, etc. (As shown of them in Fig. 1). So for these parameters, we monitored data with time stamp and extracted them at the exact time since only data before the day or accurately on the day of CTA imaging were needed [1, 42]. In the meanwhile, some characteristics have static baselines and do not change over time [42, 44]; for instance, age (at the time of CTA imaging), smoking status, gender, and past medical history (As shown in Fig. 2), were aggregated directly from EHR.

Fig. 1.

Fig. 1

Dynamic parameters

Fig. 2.

Fig. 2

Static parameters

For extracting some parameters expert's knowledge and interpretation were needed for extracting actionable information or logical way of existing medical prognosis, such as a survey of CTA images for extracted the presence or absence of PE, reviewing of handwritten notes for taking out vital signs information, or examining the records of the relationship of the patient's medical history with their current condition. [45, 46].

In particular, the study cohort was composed of 925 PE suspected cases data. To have a balanced dataset, and considering that negative cases usually tend to be more than positives [47], stratified random sampling was used for negative ones to obtain a balanced dataset between positive and negative labels; so, approximately equal distribution labels were gathered (around 49% positive cases and 51% negative).

The collected dataset consists of structured data. EHR data consist variety of formats such as numbers, texts, symbols, and images. [4850]. The CTA images were investigated by radiologists, to extract two binary labels as positive or negative (the presence or absence of PE). The corpus of handwritten notes was manually investigated by the physicians, to extract structured data such as vital signs (body temperature, Pulse Rate (PR), Heart Rate (HR), etc.). The high-level architecture of components used in the development is shown in Fig. 3.

Fig. 3.

Fig. 3

The high-level architecture of components used in the development

Data preprocess

Data pre-processing is a significant and necessary step in any machine learning analysis that converts raw data into usable formats to facilitate training and testing processes that can enhance the yield of any machine learning methods [4]. The dataset has been created and split into the training and test dataset (80:20) randomly. 925 CTA imaging report included a date that was retrieved and confirmed by radiologists and classified as two binary class labels (PE absent or present). The training dataset was used for prediction model training, while the test dataset was used for validation and testing [51]. The overall cohort workflow chart is illustrated in Fig. 4.

Fig. 4.

Fig. 4

Cohort workflow

The data pre-processing steps are as follows:

Feature selection

Feature selection is about selecting the most relevant, non-redundant, and appropriate features for utilization in ML models [52, 53]. Feature selection before applying ML algorithms is a crucial step that modifies the efficiency of the models and enhances yield [52]. We start with raw EHR data of patients. According to the complexity and diversity of EHR data, many features aggregated, but some were useless. At first, with a comprehensive survey of references and guidelines such as UpToDate [54], Harrison [55], and Murray [56], and pieces of literature survey [1, 9, 10], 60 risk factors were aggregated. The Delphi method was therefore applied to gather the pulmonologist's insight about extracted features [57]. Hence, 42 relevant features were selected by experts. The feature selection process is shown in Fig. 5.

Fig. 5.

Fig. 5

Feature selection process

Also, we performed feature engineering to make the dataset more insightful and prepare data for the machine learning model that involves: the creation of observations of raw data, and the transformation of them into features for use by ML models [53, 58]. We captured all information of patients (until the day of CTA imaging) with a time stamp. We designed a feature engineering pipeline in the observation window that calculates a vector representation of each patient's EHR with a time stamp. All laboratory test results, medications, vital signs, past medical history, and demographic data, were included in the data engineering pipeline.

In this paper, encoding and one-hot encoding techniques applied for categorical data, to convert entities into columns with each unique category [59]. We dealt with the patient's medical history consisting of various diseases like heart, kidney, internal, neural disease, etc. We used encoding and on-hot encoding techniques to represent numerical or binary values. Also, we leveraged feature standardization for each feature separately. The scaled amount has been used in training and test set [1, 60].

Data transformation

Transformation methods should be implemented in the following steps [61]:

(1) Data enrichment: Data enrichment is the process of enhancing existing information by supplementing missing parameters. That was the most time-consuming step of processing and a key factor in building high-performing models of ML. In this research, the data enrichment process implemented as follows:

First, we combined multiple data sources to create a more comprehensive and precise dataset [50]. Then, we appended various data from different sources. For example, we gathered laboratory results, medications, radiology tests, demographic information, vital signs, and past medical history to obtain a more informative dataset. In the next stage, data segmentation or classification were done for categorizing the entities. We separated and categorized labels (positive or negative PE). At last, we have focused on the attributes extraction, which was not in the raw dataset but can derive from another field, and described the main characteristics of the raw data [62]. For instance, diabetes derived attribute of a fastening blood sugar test, and fever was derived from the patient's body temperature. Finally, entity extraction was done, which is the method of taking unstructured data such as CTA images and extracting significant structured data (PE positive or negative) [49], by using expert knowledge to extract meaningful structured data.

(2) Imputation: we divide the features into four groups: the feature values are not missing (‘‘0%’’), missing values less than 15% but greater than 0% (‘‘0–15%’’), missing degree is in 15–50% (‘‘15–50%’’), and missing values is greater than 50% (‘‘ > 50%’’). We remove features that their missing degree is greater than 50%. For features in group ‘‘15–50%’’, we analyze the distributions of the value for each feature, so we used the median; some features contained a few missing values (< 15%); the missing values were imputed by the mean value. [63]; a preliminary percent of missing features or variables that we used for this study is illustrated in Fig. 6.

Fig. 6.

Fig. 6

Missing degree of initial features

(3) Scaling: Scaling on each feature happens independently by computing fitting statistics on the training set. Each feature was standardized, which results in a zero mean to the standard deviation of any descriptor, as shown in Eq. 1.

xscaled=x-xmeanxstd 1

Model framework

In this section, a novel interconnected PE prediction methodology in the presence of a real dataset was proposed. The method involves the following steps:

First- stage of proposed machine learning framework

After extracting the data from the EHR, we had 42 risk factors. So in this step, a dataset with 42 features is inputted to different machine learning classification methods, outputting the PE diagnosis. We found that the LightGBM model has better accuracy and best performance, so we implemented the LightGBM feature importance technique to identify the most effective features. Our clinicians approved the importance ranking’s result. So, due to the high validity and fast speed of the LightGBM method, in this step, we identified the top features and removed the trivial ones [37] and tested the LightGBM model again with the top selected ones, so we found approximately similar accuracy of PE prediction with that of 42 features. The approach indicates that the removed features have not efficiency for PE detection.

Light GBM LightGBM is gradient boosting framework [64] designed by Microsoft [65], that aims to enhance computational performance, so that the prediction problem can be solved very fast and more effectively. The algorithm is based on decision tree and can be used both for classification and ranking. In LightGBM, the histogram-based algorithm and trees leaf-wise growth strategy with a maximum depth limit is adopted to fast training and decrease memory consumption. Leaf-wise growth is an effective for growing trees. Suppose we want build a LightGBM model with T trees, and for a specified dataset with n samples, the additive training process can be described as Eq. 2. [66]:

y^i0=0
y^i1=f1xi=y^i0+f1xi
y^i2=f1xi+f2xi=y^i1+f2xi
y^it=k=1tfkxi=y^it-1+ftxi 2

where y^i(t) is the prediction of i-th example at the t-th iteration and ft is the learned function for the t-th decision tree. In each iteration, we keep the current model y^i(t) and add a new function f into the model. The f s of all iterations can be learned by minimizing the Eq. 3.

Lt=inlyi,y^it+t=1TΩft 3

The first term is the loss function measuring the difference between the prediction y^i(t) and the target yi, and the second is the regularization term which penalize the complexity of the model [67]. Information gain indicates the expected reduction in entropy caused by splitting the nodes based on attributes. It finds the leaf with the largest splitting gain from all the current leaves each time, and then splits the leaf, and circulate this process. To rephrase it, it will choose the leaf with max delta loss to grow. Compared with level-wise growth strategy, leaf-wise one can reduce more errors and obtain better accuracy under the same splitting times. Therefore, LightGBM adds a maximum depth limit on leaf-wise to ensure high outcome while preventing over fitting.

Feature importance Feature Importance refers to techniques that compute score for all the input features for a specific model, the scores actually demonstrate the “importance” of each feature. A higher score means that the particular feature will have more effect on the model that is being used [68]. The feature importance ranking is based on the importance type ‘‘split’’ (in ‘‘feature_importance’ function), which computes numbers of the times the feature is used in LightGBM to represent the importance of that feature.

The machine learning models are applied on the input data, according to the results the LightGBM technique has superiority, so the performance of this model explained:

We set parameters, this part consists of the following steps:

  1. Set parameter ‘‘num_leaves’’ to control the complexity of the tree model.

  2. Set‘‘max_depth’’ to limit the tree depth explicitly and avoid constructing trees too deep.

  3. Set ‘‘min_data_in_leaf’’to prevent over-fitting in a leaf-wise tree, this parameter indicates the minimum number of samples per leaf node. It is an important parameter to deal with over-fitting of leaf-wise grown trees.

  4. Set the ‘‘learning_rate’’ [69].

  5. Refit the classifier and model was repeated process with k-fold cross-validation [70].

  6. Perform feature importance analysis and identified top effective features for PE prediction (Fig. 7).

Fig. 7.

Fig. 7

Feature importance process

Second- stage of proposed machine learning framework

Following the previous step, in this stage, ineffective features as well as features that have a low impact were eliminated, and only the most influential ones remained.

A brief description of the second stage of the proposed framework is provided below:

  1. A set of different classification algorithms were applied, like Ridge classification, Catboost, Decision tree, and Nearest Neighbor.

  2. Hyper parameters of each algorithm are optimized by using an annealing algorithm provided by the Python hyperopt package [44].

  3. Model selection was repeated process with K-fold cross-validation (k = 3).

  4. Accuracy was measured with F1-score which is very reliable in classification algorithms that combined both precision and recall.

The overall workflow of the stages of this paradigm from data collection to the application of machine learning algorithms to predict PE is illustrated in Fig. 8. In this section, a brief explanation of the ML algorithms used in the second stage is described.

Fig. 8.

Fig. 8

Overview of interconnected ML framework for PE prediction

Ridge Classification The Ridge Classifier solves the problem with the regression method. This classification model is based on the Ridge regression method and converts the label data into [− 1, 1]. The highest value in prediction is accepted as a target class [71].

In this section, a brief explanation of the ML algorithms used in the second stage is described.

Catboost Catboost classifier is an algorithm for gradient boosting on decision trees. It is developed by Yandex researchers and engineers. In CatBoost, symmetric trees, or balanced trees, refer to the splitting condition being consistent across all nodes at the same depth of the tree. This classifier can work with different data types to solve a problem. To top it up, usually this algorithm provides best-in-class accuracy [72]. Catboost provides state-of–the-art results and it is a high performance, Also this classifier can handle categorical features automatically and it is easy to use [72].

Decision tree A decision tree classifier is a tree-structured classifier, where internal nodes demonstrate the features of a dataset, branches represent the decision rules and each leaf node represents the results. In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. The creation of sub-nodes increases the homogeneity of yield sub-nodes [38, 70].

Nearest neighbor The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point and predict the label from these. Nearest neighbor classification is a ML method that aims at labeling previously unseen query objects while identifying two or more classes. As any classifier, It requires some training data with given labels and, thus is an instance of supervised learning [70].

We compared the result of the classification models in different types (e.g. tree base, distance base, and boosting, etc.). Acceptable results were achieved by the models, however, ridge classification has best F1- score and accuracy in this regard.

Performance metrics

We used Classification accuracy (CA), Precision, Recall, and F1-score to evaluate the performance of the model.

Accuracy=TP+TNTP+TN+FP+FN 3
Precision=TPTP+FP 4
Recall=TPTP+FN 5
F1score=2×preciosion×recallprecision+recall 6

Also, we use the Confusion Matrix (CM) for evaluating the performance of a classification model. The number of correct and incorrect outputs are summarized with count values in CM [73].

TP and TN occur when the predictive value matches the actual value, FP is when the predicted value was falsely predicted and the actual value was negative but the model predicted a positive value, FN is about the actual value was positive but the model predicted a negative value [17].

Results

In this section, we have implemented our proposed framework according to the collected data from different departments of Sina educational hospital, a major tertiary hospital in Tehran, Iran. Sina University Hospital is the first Iranian hospital established in 1837 in the heart of Tehran's historical district. In this study, patients from different departments of the hospital and also outpatients, were screened. After follow-up, 925 patients were suspected of PE. After feature extraction and selection, there were 42 features pre-selected.

Implementing the first stage

As a first step, we implement different machine-learning classification methods for PE diagnosis. We found that the LightGBM model has better accuracy, so we follow up this step by using the LightGBM feature importance attribute for highlighting the most important variables and eliminating redundant and irrelevant features according to the high performance of this method. The hyper-parameters to be tuned consisted of the learning rate, maximal tree depth, and subsample rate, which could improve model efficiency. The performance of LightGBM and the result of feature importance analysis are shown in Figs. 9. and 10. Firstly, we apply the input data with 42 features on LightGBM, then we perform feature importance analysis. As shown in Fig. 9, we extract the top 24 features for further analysis.

Fig. 9.

Fig. 9

Feature importance

Fig. 10.

Fig. 10

Feature importance rates calculated by LightGBM

Our clinicians approved and confirmed the result of LightGBM feature importance results and we tested LightGBM model with 24 selected features to evaluate the performance, we found approximately similar results between 42 and 24 features. This indicates that removed features are not effective in PE prediction. The F1-score of LightGBM model with 42 features and 24 top selected features were shown in Table 1.

Table 1.

F1-score of LightGBM model with 42 Features and top 24 selected features

42 features Top 24 selected features
F1- score (Light GBM model) 90.02 90.89

The selected highly related features were patients in the early-intubation group at the hospital, hospitalization in ICU, and the presence of malignancy disease; COVID patients, having chest pain and edema in legs reduces but still maintains high importance. Further, we observe that chief complaints of accidents and surgery showed more importance. Parameters in laboratory test results are also crucial for PE prediction, such as D-Dimer. 24 top features were extracted for further analysis.

Implementing the second stage

In step 2, we have captured the features which were obtained from step 1 and applied four ML classification methods including Ridge classification, Catboost algorithm, Decision tree, and nearest neighbor. The PE prediction results of the aforesaid methods are illustrated in Table 2 and the comparison of them were shown in Fig. 11. We calculate the averaged precision, recall, F1 score, and accuracy, (using threefold cross-validation). The results indicate that all these methods show good results in F1 score, precision, and accuracy but illustrate the superiority of Ridge classification with F1score = 0.968.

Table 2.

The results of different algorithms on test set

Precision Recall F1 score Accuracy
Ridge classification 0.957 0.978 0.968 0.967
Catboost 0.956 0.956 0.945 0.956
Decision Tree 0.893 0.875 0.891 0.881
Nearest Neighbor 0.936 0.709 0.68 0.774

Fig. 11.

Fig. 11

F1 score

The confusion matrix of the achieved results by Ridge classification model for test set is shown in Fig. 12. Which is a tabular representation of the model estimated and the actual values of the test set. It consists of four separate combination of the predicted, actual values namely True Negative, True Positive and False Positive and False Negative. True Positive—The model has predicted 91 times that the patient is having PE accurately. True Negative—The model has predicted 88 times that the patient is not having the disease accurately. False Negative—The model has predicted 2 times that the patient is not having the disease when the model has predicted the patient has disease. False Positive—The model has predicted 4 times that the patient is having disease while the patient is not having the disease.

Actually Positive Actually Negative
Predicted Positive True Positives False Positives
Predicted Negative False Negatives True Negatives

Fig. 12.

Fig. 12

Confusion matrix

As shown in Fig. 11. A Low false negative (approximately < 1.5%) indicates that this interconnected ML model can perform well. This low probability of misdiagnosis might be easily overlooked versus a costly diagnostic workup in a population.

Discussion

This research proposed a two-stage interconnected ML model for predicting PE, which carried out with the aim of better and more efficient management of PE diagnosis. This approach leads to a proper and prompt diagnosis of PE and reduces unnecessary CTA imaging that has an exorbitant burden and cost on the healthcare system [74].

Despite the existence of definitive PE recognition strategies, diagnosing PE remained challenging. Existing methods based on scoring systems and CTA [1] led to unneeded imaging, which was associated with healthcare costs for the healthcare system, and insurance companies. Scoring systems like the Wells and revised Geneva that were most commonly used for PE prognosis [42]. The Wells scores sensitivity was reported from 63.8 to 79.3% [15] and, the revised-Geneva scores sensitivity ranged around 55.3% [75]. So, because of the low yield of scoring systems, CTA Imaging has been used extensively [76]. Comparison between the accuracy of the scoring systems versus this machine learning PE diagnosis method with an accuracy of 96.8% are shown in Table 3.

Table 3.

Comparison between accuracy of the scoring systems and this machine learning model

PE prognosis method Performance
Wells 63.8- 79.3% [15, 75, 77]
rGeneva 55.3- 60% [75, 78]
CTA One- third of imaging tests are avoidable [76, 79]
This machine learning model 96.8%

Reports indicated that one-third of imaging tests performed to diagnose pulmonary embolism are avoidable, costing more than $100 million annually for the healthcare system [76, 79], and this was confirmed by our collected dataset since only about 25% of positive PE cases were reported. Hence we can claim that three-quarters of the performed imaging was extra. The current existing method’s yield must be weighed against redundant costs for patients and healthcare systems, harm from ionizing radiation, allergic reactions, intravenous contrast material, and extra costs.

Limitations of the existing diagnosis methods are shown in Table 4.

Table 4.

Limitation of the existing diagnosis methods

Current diagnostic methods Limitations
Scoring systems (Wells, rGeneva)

Lack of standardization [1, 76]

Not reliable [38, 76]

CTA imaging

Risk of radiation [3, 48, 80]

Risk of intravenous contrast material [1, 3]

Low incidental findings [10]

Allergic reactions [80]

Costly [25, 73, 76, 81]

Time-consuming [4, 36]

Due to the literature survey, we found that previous studies restricted to image analysis and specific characteristics limited to scoring systems criteria, conducted researchers obtained accuracy performance ranging between 0.63 and 0.9 [1]. The motivation of this study was to use EHR, obtained from different aspects rather than focusing on scoring systems criteria as well as Wells or rGeneva, such as real-world treatment and follow-up data, real-time data, medical history, clinical laboratory test results, which was associated with improved data integrity and diversity and therefore sensitivity in the prediction results. For this work, we applied two-step interconnected ML techniques—feature importance analysis according to the best accuracy result of LightGBM and then ML models— applied to retrospective structured EHR data and achieved F1score = 0.96. In our case, various strategies were used to enhance dataset quality, including ML algorithms to handle missing data, outliers, and scaling. Missing data in the dataset are often informative [82], and hence the availability of some variables could be predictive factors associated with clinical states. In the quality assessment of features, the magnitude of data quality increased over the first stage of this paradigm and was associated with high-performance PE prognosis from clinical states.

Due to privacy concerns, data collection in medical research is time-consuming, costly, and limited by medical regulations. The empirical results of this model have shown the clinical advantages of this research. Designing a method that can use routinely collected patient healthcare data to arrive at patient-specific disease outcome prognosis could better prescribe care decisions for patients with suspected PE if it works successfully. The novelty of this ML paradigm is to provide an accurate and timely prognosis for PE diagnosis based on EHR data.

From this viewpoint, this interconnected ML model has led to disruptive innovation in PE diagnosis by decreeing unnecessary imaging and reducing costs and complications for patients and medical systems. So, we believe it has enough potential and cost-effective application to improve the use of CTA imaging. This model predicted PE based on patients' clinical raw data and had noninvasive, facile, and precise benefits, which were extremely appropriate for the conduction of PE diagnosis. In Table 5, the benefits of this two-step interconnected ML model for PE diagnosis in different parts of society were summarized.

Table 5.

Benefits of interconnected machine learning model for PE diagnosis

Different parts of society Benefits of a two-step interconnected ML prediction model for PE detection
Clinicians

Accurate and rapid diagnosis of PE

Prompt therapeutic decisions

Healthcare system

Eliminate unneeded CTA imaging

Proper resources allocation management

Reduce costs

Patients

Decrease risk of intravenous contrast material and radiation

Receive proper treatment measures timely

Reduce costs

Insurance company Reduce costs
AI developers Develop PE detective apps

This study may augment various findings on prognosis. For example, this cohort may be helping to provide a better understanding of the association between specific medical history and improved survival outcomes, and also model clinical development with comprehensive variables collected in routine clinical care.

There are several limitations to this retrospective study. First, the creation of a dataset is a very time-consuming and difficult process, and the lack of data is a challenging issue due to medical privacy concerns regulations, availability of not enough data leads researchers to use techniques like data augmentation, transfer learning, and fine-tuning models to enhance predictions’ accuracy and having access to a wide range of cases can improve performance. Also, some variables are hand-selected (Heart rate, saturation, etc.), so the probability of errors is high. On the other hand, we have some missing variables and used imputation methods for those with a missing value. Second, the CTA image cases on which we trained the dataset and the test set come from the same hospital, which might bias the results positively. The evaluation of external datasets will remain for future works. Third, the proposed method only predicts PE, we have restricted distinguishing its place or classifying them as obstructive and non-obstructive, which can be impossible to obtain with this dataset.

Conclusion

In this paper, a two-step interconnected ML algorithm is applied to predict PE and manage PE diagnosis protocol. Many studies have shown that using CTA imaging with existing challenges, such as the availability of radiologists for quick response as well as human errors, on the other hand, restricted features limited to scoring systems criteria have been a gap in this field.

In this diagnostic study, 925 suspected PE patients' EHR data were extracted from different departments of Sina educational hospital affiliated with the Tehran University of medical sciences, Iran.

We performed a comprehensive feature importance analysis by using the LightGBM method to recognize top features due to its best performance in PE prediction in the first stage (F1- score = 0.90). When training the LightGBM, we first utilized 42 features as input, then selected the top features according to the result of the feature importance analysis for further analysis. Our clinicians also approved the results. So, we applied different ML models with 24 selected features. Experimental results demonstrated that using the LightGBM classifications feature importance analysis with other ML methods such as Ridge classification, Catboost classifier, Decision Tree, and Nearest Neighbour is pragmatic for PE prediction. Evidence showed that the Ridge classification had the best results (F1 score = 0.96). The findings of this study propose that this two-step interconnected ML model by the use of retrospective temporal patients EHR can apply as an automated tool to improve the use of CTA imaging and appear to be valuable and feasible for accurate information for clinical decision-making. In addition, it can help the healthcare system by reducing costs and improving the diagnosis process. The initially stated aim of this research was to predict PE based on EHR data. So, common machine learning methods were applied for PE diagnosis. Developing new ML models for achieving higher performance is proposed for the future research.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial or non- for- profit sectors.

Declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships between the authors and any organization that could have appeared to influence the work reported in this paper.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Banerjee I, Sofela M, Yang J, et al. Development and performance of the pulmonary embolism result forecast model (PERFORM) for computed tomography clinical decision support. JAMA Netw Open. 2019. 10.1001/jamanetworkopen.2019.8719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ma H, Sheng W, Li J, et al. A novel hierarchical machine learning model for hospital-acquired venous thromboembolism risk assessment among multiple-departments. J Biomed Inform. 2021;122: 103892. 10.1016/j.jbi.2021.103892. [DOI] [PubMed] [Google Scholar]
  • 3.Cano-Espinosa C, Cazorla M, González G. Computer aided detection of pulmonary embolism using multi-slice multi-axial segmentation. Appl Sci. 2020. 10.3390/APP10082945. [Google Scholar]
  • 4.Huang SC, Kothari T, Banerjee I, et al. PENet—a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric CT imaging. NPJ Digit Med. 2020. 10.1038/s41746-020-0266-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Shi L, Rajan D, Abedin S, et al (2020) Automatic diagnosis of pulmonary embolism using an attention-guided framework: a large-scale study. In Medical imaging with deep learning, pp 743–754. PMLR
  • 6.Shi L, Dehghan E (2020) Automatic diagnosis of pulmonary embolism using an attention-guided framework : a large-scale study. 1–12
  • 7.Kiourt C, Feretzakis G, Dalamarinis K, Kalles D (2021) Pulmonary embolism identification in computerized tomography pulmonary angiography scans with deep learning technologies in COVID-19 patients. arXiv:2105.11187
  • 8.Valle C, Bonaffini PA, Dal Corso M, et al. Association between pulmonary embolism and COVID-19 severe pneumonia: experience from two centers in the core of the infection Italian peak. Eur J Radiol. 2021. 10.1016/j.ejrad.2021.109613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Sakr Y, Giovini M, Leone M, et al. Pulmonary embolism in patients with coronavirus disease-2019 (COVID-19) pneumonia: a narrative review. Ann Intensive Care. 2020;10:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Thachil R, Nagraj S, Kharawala A, Sokol SI. Pulmonary embolism in women: a systematic review of the current literature. J Cardiovasc Dev Dis. 2022. 10.3390/jcdd9080234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Morís DI, de Moura Ramos JJ, Buján JN, Hortas MO. Data augmentation approaches using cycle-consistent adversarial networks for improving COVID-19 screening in portable chest X-ray images. Expert Syst Appl. 2021;185: 115681. 10.1016/j.eswa.2021.115681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kiourt C, Feretzakis G, Dalamarinis K, et al (2021) Pulmonary embolism identification in computerized tomography pulmonary angiography scans with deep learning technologies in COVID-19 patients. arXiv:2105.11187
  • 13.Mountain D, Keijzers G, Chu K, et al. Correction: RESPECT-ED: rates of pulmonary emboli (PE) and sub-segmental PE with modern computed tomographic pulmonary angiograms in emergency departments: a multi-center observational study finds significant yield variation, uncorrelated with use or smal. PLoS ONE. 2017;12:2015–8. 10.1371/journal.pone.0184219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kocher KE, Meurer WJ, Fazel R, Scott PA. National trends in use of computed tomography in the emergency department. YMEM. 2011;58:452-462.e3. 10.1016/j.annemergmed.2011.05.020. [DOI] [PubMed] [Google Scholar]
  • 15.Wang RC, Bent S, Weber E, et al. The impact of clinical decision rules on computed tomography use and yield for pulmonary embolism: a systematic review and meta-analysis. Ann Emerg Med. 2016;67:693-701.e3. 10.1016/j.annemergmed.2015.11.005. [DOI] [PubMed] [Google Scholar]
  • 16.Shahid O, Nasajpour M, Pouriyeh S, et al. Machine learning research towards combating COVID-19: virus detection, spread prevention, and medical assistance. J Biomed Inform. 2021;117: 103751. 10.1016/j.jbi.2021.103751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Rucco M, Rodrigues DS, Merelli E, et al. Neural hypernetwork approach for pulmonary embolism diagnosis. BMC Res Notes. 2015. 10.1186/s13104-015-1554-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Puaschunder JM. The potential for artificial intelligence in healthcare. SSRN Electron J. 2020;6:94–8. 10.2139/ssrn.3525037. [Google Scholar]
  • 19.Rysavy M. Evidence-based medicine: a science of uncertainty and an art of probability. Virtual Mentor. 2013;15:4–8. 10.1001/virtualmentor.2013.15.1.fred1-1301. [DOI] [PubMed] [Google Scholar]
  • 20.Menegotto AB, Becker CDL, Cazella SC. Computer-aided diagnosis of hepatocellular carcinoma fusing imaging and structured health data. Heal Inf Sci Syst. 2021. 10.1007/s13755-021-00151-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wu C, Guo S, Hong Y, et al. Discrimination and conversion prediction of mild cognitive impairment using convolutional neural networks. Quant Imaging Med Surg. 2018;8:992–1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Fisher CK, Smith AM, Walsh JR, et al. Machine learning for comprehensive forecasting of Alzheimer’s disease progression. Sci Rep. 2019. 10.1038/s41598-019-49656-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Arco JE, Ramírez J, Górriz JM, Ruz M. Data fusion based on Searchlight analysis for the prediction of Alzheimer’s disease. Expert Syst Appl. 2021. 10.1016/j.eswa.2021.115549. [Google Scholar]
  • 24.Thabtah F, Spencer R, Ye Y. The correlation of everyday cognition test scores and the progression of Alzheimer’s disease: a data analytics study. Heal Inf Sci Syst. 2020. 10.1007/s13755-020-00114-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ryan L, Mataraso S, Siefkas A, et al. A machine learning approach to predict deep venous thrombosis among hospitalized patients. Clin Appl Thromb. 2021. 10.1177/1076029621991185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Wiener RS, Gould MK, Arenberg DA, et al. An official American Thoracic Society/American College of Chest Physicians policy statement: implementation of low-dose computed tomography lung cancer screening programs in clinical practice. Am J Respir Crit Care Med. 2015;192:881–91. 10.1164/rccm.201508-1671ST. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Danzi GB, Loffi M, Galeazzi G, Gherbesi E. Acute pulmonary embolism and COVID-19 pneumonia: a random association? Eur Heart J. 2020;41:1858. 10.1093/eurheartj/ehaa254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Sadik F, Dastider AG, Subah MR, et al. A dual-stage deep convolutional neural network for automatic diagnosis of COVID-19 and pneumonia from chest CT images ✩. Comput Biol Med. 2022;149: 105806. 10.1016/j.compbiomed.2022.105806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Feki I, Ammar S, Kessentini Y, Muhammad K. Federated learning for COVID-19 screening from Chest X-ray images. Appl Soft Comput. 2021;106: 107330. 10.1016/j.asoc.2021.107330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for COVID-19. J Big Data. 2021. 10.1186/s40537-020-00392-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Goel K, Sindhgatta R, Kalra S, et al. The effect of machine learning explanations on user trust for automated diagnosis of COVID-19. Comput Biol Med. 2022;146: 105587. 10.1016/j.compbiomed.2022.105587. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Bertsimas D, Borenstein A, Mingardi L, et al. Personalized prescription of ACEI/ARBs for hypertensive COVID-19 patients. Health Care Manag Sci. 2021;24:339–55. 10.1007/s10729-021-09545-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Liu Y, Qin J, Fan Y, et al. Estimation of infection density and epidemic size of COVID - 19 using the back—calculation algorithm. Heal Inf Sci Syst. 2020. 10.1007/s13755-020-00122-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Yang Y, Li Y, Chen R, et al. Risk prediction of renal failure for chronic disease population based on electronic health record big data. Big Data Res. 2021. 10.1016/j.bdr.2021.100234. [Google Scholar]
  • 35.Bertsimas D, Orfanoudaki A, Weiner RB. Personalized treatment for coronary artery disease patients: a machine learning approach. Health Care Manag Sci. 2020;23:482–506. 10.1007/s10729-020-09522-4. [DOI] [PubMed] [Google Scholar]
  • 36.Schmuelling L, Franzeck FC, Nickel CH, et al. Deep learning-based automated detection of pulmonary embolism on CT pulmonary angiograms: no significant effects on report communication times and patient turnaround in the emergency department nine months after technical implementation. Eur J Radiol. 2021;141: 109816. 10.1016/j.ejrad.2021.109816. [DOI] [PubMed] [Google Scholar]
  • 37.Soffer S, Klang E, Shimon O, et al. Deep learning for pulmonary embolism detection on computed tomography pulmonary angiogram: a systematic review and meta-analysis. Sci Rep. 2021;11:1–8. 10.1038/s41598-021-95249-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Serpen G, Tekkedil DK, Orra M. A knowledge-based artificial neural network classifier for pulmonary embolism diagnosis. Comput Biol Med. 2008;38:204–20. 10.1016/j.compbiomed.2007.10.001. [DOI] [PubMed] [Google Scholar]
  • 39.Manshad A, Akbilgic O, Brailovsky Y, et al. Machine learning-based prediction of 30-day all-cause mortality in patients hospitalized with acute pulmonary embolism. Chest. 2020;158:A2213–4. 10.1016/j.chest.2020.08.1892. [Google Scholar]
  • 40.Jenab Y, Hosseini K, Esmaeili Z, et al. Prediction of in-hospital adverse clinical outcomes in patients with pulmonary thromboembolism, machine learning based models. Front Cardiovasc Med. 2023;10:1–10. 10.3389/fcvm.2023.1087702. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Arbet J, Brokamp C, Meinzen-derr J, et al. Lessons and tips for designing a machine learning study using EHR data. J Clin Transl Sci. 2020. 10.1017/cts.2020.513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Ma L, Zhang C, Wang Y, et al (2020) ConCare: personalized clinical feature embedding via capturing the healthcare context. In: AAAI 2020—34th AAAI conference on artificial intelligence, pp. 833–40. 10.1609/aaai.v34i01.5428
  • 43.Leontjeva A, Kuzovkin I (2016) Combining static and dynamic features for multivariate sequence classification. In: Proceedings of 3rd IEEE international conference on data science and advanced analytics DSAA 2016, pp. 21–30. 10.1109/DSAA.2016.10
  • 44.Kumar A (2018) A framework for malware detection with static features using machine learning algorithms. A thesis submitted by Ajit Kumar in partial fulfillment of the requirements for the award of the degree. 10.13140/RG.2.2.35593.90723
  • 45.Li Z, Zhao S, Chen Y, et al. A deep-learning-based framework for severity assessment of COVID-19 with CT images. Expert Syst Appl. 2021;185: 115616. 10.1016/j.eswa.2021.115616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Lucas PJF. Logic engineering in medicine. Knowl Eng Rev. 1995;10:153–79. 10.1017/S0269888900008134. [Google Scholar]
  • 47.Scudiero F, Silverio A, Di Maio M, et al. Pulmonary embolism in COVID-19 patients: prevalence, predictors and clinical outcome. Thromb Res. 2021;198:34–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Weikert T, Nesic I, Cyriac J, et al. Towards automated generation of curated datasets in radiology: application of natural language processing to unstructured reports exemplified on CT for pulmonary embolism. Eur J Radiol. 2020;125: 108862. 10.1016/j.ejrad.2020.108862. [DOI] [PubMed] [Google Scholar]
  • 49.Tayefi M, Ngo P, Chomutare T. Challenges and opportunities beyond structured data in analysis of electronic health records. Wiley Interdiscip Rev. 2021;13(6):e1549. 10.1002/wics.1549. [Google Scholar]
  • 50.Indexed S. Conversion of unstructured data to structured data with a profile. Int J Mech Eng Technol. 2017;8:623–30. [Google Scholar]
  • 51.Schiaffino S, Codari M, Cozzi A, et al. Machine learning to predict in-hospital mortality in covid-19 patients using computed tomography-derived pulmonary and vascular features. J Pers Med. 2021. 10.3390/jpm11060501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Datia N. Data mining algorithms for computer aided detection of pulmonary embolism : a comparative study. 2014
  • 53.Nargesian F, Samulowitz H, Khurana U, et al. Learning feature engineering for classification. Int Jt Conf Artif Intell 2017. 10.24963/ijcai.2017/352
  • 54.Card QR UpToDate ® Advanced
  • 55.Harrison TR, Resnick WR. Harrison’s principles of internal medicine. 618. 2022
  • 56.Watson KL. Medical microbiology. 2. 1978 [PubMed]
  • 57.Shang Z. Use of Delphi in health sciences research: a narrative review. Medicine. 2023. 10.1097/MD.0000000000032829. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Chicco D, Oneto L, Tavazzi E. Eleven quick tips for data cleaning and feature engineering. PLoS Comput Biol. 2022;18:1–21. 10.1371/journal.pcbi.1010718. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Erjavac I, Kalafatovic D, Mau G. Artificial intelligence in the life sciences coupled encoding methods for antimicrobial peptide prediction: how sensitive is a highly accurate model? Artif Intell Life Sci. 2022. 10.1016/j.ailsci.2022.100034. [Google Scholar]
  • 60.Sahoo SS, Kobow K, Zhang J, et al. Ontology-based feature engineering in machine learning workflows for heterogeneous epilepsy patient records. Sci Rep. 2022;12:1–11. 10.1038/s41598-022-23101-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Ebinger J, Wells M, Ouyang D, et al. A machine learning algorithm predicts duration of hospitalization in COVID-19 patients. Intell Med. 2021;5: 100035. 10.1016/j.ibmed.2021.100035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Andres M, Amell N, Awais M, et al. MethodsX attribute value extraction mechanism of constructed wetlands information. MethodsX. 2019;6:1054–67. 10.1016/j.mex.2019.04.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Jakobsen JC, Gluud C, Wetterslev J, Winkel P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—a practical guide with flowcharts. BMC Med Res Methodol. 2017. 10.1186/s12874-017-0442-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3147–55. [Google Scholar]
  • 65.Liang W, Luo S, Zhao G, Wu H. Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms. Mathematics. 2020;8:1–17. 10.3390/MATH8050765. [Google Scholar]
  • 66.Fang X, Gao H, Wu J. Prediction of extubation failure for intensive care unit patients using light gradient boosting machine. IEEE Access. 2019;7:150960–8. 10.1109/ACCESS.2019.2946980. [Google Scholar]
  • 67.Yu B. Fertility—LightGBM: a fertility—related protein prediction model by multi-information fusion and light gradient boosting machine. Biomed Signal Process Control. 2020;68:1–17. [Google Scholar]
  • 68.Tariq A, Celi LA, Newsome JM, et al. Patient-specific COVID-19 resource utilization prediction using fusion AI model. NPJ Digit Med. 2021. 10.1038/s41746-021-00461-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Fayed HA, Atiya AF. Speed up grid-search for parameter selection of support vector machines. Appl Soft Comput J. 2019;80:202–10. 10.1016/j.asoc.2019.03.037. [Google Scholar]
  • 70.Darapureddy N, Karatapu N, Battula TK. Research of machine learning algorithms using K-fold cross validation. Int J Eng Adv Technol. 2019. 10.35940/ijeat.F1043.0886S19. [Google Scholar]
  • 71.Grüning M, Kropf S. A ridge classification method for high-dimensional observations. Data Inf Anal Knowl Eng. 2006. 10.1007/3-540-31314-1_84. [Google Scholar]
  • 72.Hancock JT, Khoshgoftaar TM. CatBoost for big data: an interdisciplinary review. J Big Data. 2020. 10.1186/s40537-020-00369-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Moreno-Ibarra MA, Villuendas-Rey Y, Lytras MD, et al. Classification of diseases using machine learning algorithms: a comparative study. Mathematics. 2021;9:1–21. 10.3390/math9151817. [Google Scholar]
  • 74.Zhang C, Ding Y, Peng Q. Who determines United States Healthcare out—of—pocket costs? Factor ranking and selection using ensemble learning. Heal Inf Sci Syst. 2021. 10.1007/s13755-021-00153-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Zhang NJ, Rameau P, Julemis M, et al. Automated pulmonary embolism risk assessment using the wells criteria: validation study. JMIR Formative Res. 2022;6:1–9. 10.2196/32230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Case-study E, Banerjee I, Ph D, et al. Prediction of imaging outcomes from electronic health records : pulmonary prediction of imaging outcomes from electronic health records: pulmonary embolism case-study. In AMIA, 3–5. 2019
  • 77.van Es N, Kraaijpoel N, Klok FA, et al. The original and simplified Wells rules and age-adjusted D-dimer testing to rule out pulmonary embolism: an individual patient data meta-analysis. J Thromb Haemost. 2017;15:678–84. 10.1111/jth.13630. [DOI] [PubMed] [Google Scholar]
  • 78.Simon MA, Tan C, Hilden P, et al. Effectiveness of clinical decision tools in predicting pulmonary embolism. Pulm Med. 2021;2021:1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Elliott CG. Evaluation of suspected pulmonary embolism in pregnancy. J Thorac Imaging. 2012;27:3–4. 10.1097/RTI.0b013e31823ba521. [DOI] [PubMed] [Google Scholar]
  • 80.Zhao F, Zheng L, Shan F, et al. Evaluation of pulmonary ventilation in COVID-19 patients using oxygen-enhanced three-dimensional ultrashort echo time MRI: a preliminary study. Clin Radiol. 2021;76:391.e33-391.e41. 10.1016/j.crad.2021.02.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Waring J, Lindvall C, Umeton R. Automated machine learning: review of the state-of-the-art and opportunities for healthcare. Artif Intell Med. 2020;104: 101822. 10.1016/j.artmed.2020.101822. [DOI] [PubMed] [Google Scholar]
  • 82.Hong S, Lynn HS. Accuracy of random-forest-based imputation of missing data in the presence interaction. BMC Med Res Methodol. 2020;1:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Health Information Science and Systems are provided here courtesy of Springer

RESOURCES