Abstract
Background
While it is advocated that the use of unstructured data extracted from medical records is important for enhancing machine learning models, few studies have evaluated whether this occurs. A retrospective, head-to-head comparative study was conducted to evaluate machine learning models for in-hospital mortality prediction. The study assessed and quantified the potential performance improvement resulting from the inclusion of unstructured data.
Methods
Hospitalizations of patients with a confirmed COVID-19 diagnosis at a tertiary teaching hospital specialized in emergency care were selected (n = 844). For the models with structured data, 21 variables were selected from laboratory tests and patient monitoring. For the hybrid models, an additional 21 clinical assertions (e.g., “has_symptom affirmed dyspnea”) were included. Six models with the best discriminative performance out of 11 trained and validated were selected for the testing phase. The most representative variables were evaluated using an explainable artificial intelligence model.
Results
The random forest model demonstrated the highest performance, achieving an area under the receiver operating characteristic curve (AUC ROC) of 0.9260, an increase from 0.9170 when using only structured data. The inclusion of unstructured data also improved sensitivity from 0.8108 to 0.8378 while specificity was maintained at 0.8667. However, these performance improvements were not found to be statistically significant different from models with only structured data.
Conclusion
The study concluded that the inclusion of unstructured data did not increase the predictive power of machine learning models for COVID-19 mortality. It was also determined that human involvement is crucial for implementation, specifically for validating natural language processing (NLP) outputs and tailoring the selection of unstructured features, given the inherent challenges in processing such data.
Clinical trial number
Not applicable.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12911-025-03178-2.
Keywords: Mortality prediction, Machine learning, Natural language processing, Covid-19, Electronic medical records
Introduction
Natural Language Processing (NLP) has been extensively applied in health for various tasks, including clinical dialogue [1] and text summarization [2] text classification [3, 4] question and answering [5] anonymization [6] document retrieval [7] topic modeling [8] and the generation of medical knowledge graphs [9].
These tasks serve diverse objectives, such as enhancing medical knowledge [10] pediatric emergency classification [11] extraction of functional poststroke outcomes [12] suicide prevention [13] depression identification [14] cancer detection [15] sepsis prediction in emergency department [16] prediction of postoperative complications [17] analysis of information quality [18] and mortality prediction [19]. All hold promises of significant transformations in medical care [20]. Nevertheless, a gap exists in the application of more sophisticated methods [21, 22]. Another gap is the prioritizing of structured data from electronic health records (EHR) [23, 24].
Although natural language processing (NLP) is widely applied to numerous health tasks, the subsequent integration and comprehensive evaluation of predictive models that incorporate NLP-derived data remain limited. Bedi et al., in a review of the use of large language models (LLMs) in health, found no studies that evaluated predictive models among 519 articles [10]. The underrepresentation of NLP-based predictive models persists across health subdomains, including cancer [25] or for mortality prediction as an outcome [26]. Despite the application of both traditional methods, such as logistic regression [27], and novel approaches such as attention mechanisms [23], few studies have systematically conducted direct comparisons to evaluate the incremental impact of incorporating NLP-extracted unstructured data into these predictive frameworks.
Indeed, predictive models for mortality that exclusively use structured data in machine learning frameworks are more common [28–31] despite the prevailing emphasis on the importance of using data from EHRs [9]. The difficulty in extracting data from EHRs partially explains this gap in their application for predictive modeling [32].
In this context, studies evaluating the impact of using unstructured data in predictive models are scarce. One identified study utilized data from the Medical Information Mart for Intensive Care, third edition (MIMIC-III), employing clinical notes for predicting in-hospital mortality [19]. The authors demonstrated that the inclusion of unstructured data enhanced the model’s discriminative power when comparing different modeling approaches.
This study aimed to address this gap by evaluating the impact of incorporating unstructured data into a mortality prediction model. We conducted a head-to-head comparative analysis of identical machine learning models, with ‘intervention’ being the inclusion of unstructured data extracted from physician-authored clinical notes using a few-shot learning method.
The focus of this study is on COVID-19 infection, as it continues to pose a significant public health challenge. The disease can become chronic, resulting in significant health and socioeconomic repercussions [33, 34]. Patients with multiple comorbidities face an elevated risk of mortality from COVID-19 infections [35, 36].
We hypothesize that models incorporating unstructured data will exhibit improved discriminative and sensitivity power for in-hospital mortality. We aim to contribute to the advancement of using unstructured data in mortality prediction models by providing essential metrics for comparison.
This study makes the following contributions:
This study provides a direct, head-to-head, and quantitative comparison of the incremental predictive value of unstructured clinical data extracted via NLP from physician-authored notes for predicting in-hospital mortality in patients with COVID-19.
Feature importance analysis using explainable artificial intelligence (XAI), specifically SHAP, is presented to identify and compare the key predictive elements arising from both structured and unstructured methods, offering insights into the factors driving predictions in models.
It evaluates the performance of multiple machine learning models across both structured-only and unstructured data scenarios, thus elucidating how different models respond to the incorporation of textual information.
Methods
The study reporting follows the TRIPOD + AI statement [37]. This updated guideline provides comprehensive recommendations to ensure the transparent and complete reporting of clinical prediction models, particularly those developed using regression or machine learning techniques.
Study population
This study was a retrospective, observational, analytical, single-centre study conducted by analysing routine care data from EHRs. The study population comprised individuals treated at the Emergency Department of a tertiary hospital. This 190-bed institution, which specializes in emergency care is part of a teaching hospital complex and is exclusively dedicated to referred emergency cases.
The study population consisted of individuals with a laboratory-confirmed diagnosis of COVID-19 who were admitted to the Emergency Department of the hospital. Confirmation was made either by a reverse transcription-polymerase chain reaction (RT-PCR) test for SARS-CoV-2 or by a rapid immunochromatographic antigen test for the virus, performed within one day of hospital admission. The study considered admissions occurring between March 1, 2020, and February 21, 2024. No exclusion criteria were applied. All patients received the same primary diagnosis (COVID-19) and were managed according to standardized hospital clinical protocols, with treatment adjustments made primarily in response to disease severity.
Outcome definition
The primary outcome was all-cause in-hospital mortality, defined as death occurring during the index hospital admission for COVID-19.
Data collection
The date of death was used as a temporal reference. The variables used in this study were recorded before the death was registered. To extract structured variables, we conducted an analysis of the laboratory tests performed on these patients. A set of 16 candidate laboratory tests was selected on the basis of being the most frequently performed and having less than 15% missing values. This choice was made to ensure data availability and reduce bias from extensive missing data. For urine and blood cultures we consider that the test was not performed when there was a missing value. We selected these two tests because of their potential as markers of case severity [38]. We also selected monitoring data entered into EHRs by nursing professionals. These monitoring data comprised additional five parameters. To account for the high variability of laboratory and monitoring data during extended hospital stays, the median value of each variable over the entire admission period was used. This method was chosen to provide a stable, representative measure that mitigates outliers and missing values. It was preferred over more complex metrics for its simplicity and clear clinical interpretability, ensuring practicality for health providers.
The history of the present illness note is a section where physicians input a synopsis of the current history, personal background, current medications, and a summary of examinations and procedures of particular interest for that admission. This record is inputted in Brazilian Portuguese and is updated daily with information pertinent to daily occurrences. The first history of the present illness recorded by a physician during that specific admission was utilized for this study. This section was selected to extract the unstructured variables. The choice was made to assess the predictive power of this initial record. The texts were processed by removing special characters, converting all words to lowercase, and expanding clinical term abbreviations using a purpose-built dictionary. The texts were then segmented into sentences using the PUNKT package from the Natural Language Toolkit (NLTK) library for subsequent natural language processing (NLP) extraction, as explained further.
NLP model for feature extraction
The objective of this process was to extract clinical assertions (CAs) from the processed history of the present illness notes. These CAs aimed to capture clinical entities (as objects e.g. fever) and their associated relationships (e.g. denies_symptom) and statuses (denied). Data extraction was performed using the Llama-3.1-8B-Instruct model. The selection of a specific 4-bit quantized version was necessary by findings from an iterative development phase. During this phase, various LLMs, tokenization approaches, and prompt configurations were explored, revealing significant computational and memory constraints. The quantized model was therefore adopted to optimize memory usage and ensure the secure processing of sensitive medical data in a controlled environment. The effectiveness of the data extraction pipeline was validated by a physician through manual assessment.
A few-shot learning paradigm was employed, guided by a specifically engineered prompt. This prompt was developed to clearly define the extraction task, and specify the desired output structure comprising a subject (implicitly the patient), a relation type (e.g., has_symptom, denies_symptom), an object (the clinical entity of interest, such as fever or back pain), a status qualifier (e.g., affirmed, denied, suspected), and the evidential text snippet (full prompt details are available in the supplementary material). A maximum of 512 tokens were allocated for each extraction attempt.
The outputs, typically multiple CAs (e.g., has_symptom affirmed: dyspnea) per hospitalization, were stored in pickle files.
Data processing
Structured data
A variable qSOFA was created to represent the Quick Sequential Organ Failure Assessment clinical criteria for organ failure [39]. This variable was a dichotomous variable assigned a value of one if at least two of the following criteria were met: respiratory rate ≥ 22 breaths per minute, systolic blood pressure ≤ 100 mmHg, and Glasgow Coma Scale score ≤ 13. This calculation was performed only for patients aged 18 years or older.
For the selected candidate variables (16 laboratory tests and five monitoring parameters), median imputation was employed to address the remaining missing values. We chose this method because the proportion of data requiring imputation for each of these variables was low (ranging from 0 to 15%) and because median imputation is a conservative approach that preserves the central tendency of the observed data. The impact of this approach on univariate distributions was subsequently validated using the Kolmogorov-Smirnov and Mann-Whitney U tests. These tests revealed no statistically significant distributional alterations except for three of the 19 variables tested following imputation. These three variables - glutamic-pyruvic transaminase, prothrombin time test, and activated partial thromboplastin time - were excluded. (Table S1 supplementary material)
Outlier treatment was performed using the z score method, with outliers defined as values exceeding three standard deviations from the mean. A physician reviewed outliers for each measure to ascertain their potential clinical veracity. Notably, the test was repeated for several outliers to confirm the results. No laboratory tests were excluded on the basis of the outlier review. The same outlier review process was applied to the monitoring data.
The structured data processing resulted in a selection of 21 structured parameters, in addition to gender and age. No further transformation was performed on these data. This process resulted in a dataset containing only structured information (structured dataset).
Unstructured data
Here, we describe the processing of the extracted clinical assertions. Relation types (e.g., has_symptom) accounted for 95% of all occurrences (29 different relation types). These relationships were manually revised to identify and consolidate similar relationships (e.g., worsening_symptom and symptom_worsens were mapped to has_symptom). In a second manual review, similar words were consolidated (e.g., respiratorio, respiratorios and respiratoria to respiratorio equivalent in English to respiratory).
Following this review consolidation, 15 relationship types accounted for 95% of all relationships and were selected for the third manual review. We consolidated similar CAs (e.g., the equivalents in Brazilian Portuguese of history_of affirmed diabetes, history_of affirmed diabetes mellitus, and history_of affirmed diabetes type 2 were consolidated to history_of affirmed diabetes mellitus). These consolidations resulted in a total of 22,939 unique CAs. These unique CAs occurred because of different object types in the CA (e.g., has_symptom affirmed pain, has_history affirmed vascular arterial disease, and deny_symptom denied fever).
To reduce the number of CAs and use them as features in the predictive models, an analysis employing a chi-squared test was applied to identify CAs that had significant potential to discriminate between deaths and discharges.
CAs were selected as features if they met the following criteria: frequency > 5, chi-squared test p-value ≤ 0.01, odds ratio > 3, and odds ratio confidence interval > 1. This selection resulted in 21 CAs. The feature selection process was principally designed to reduce noise, increase model interpretability by focusing on highly discriminative predictors, and mitigate the risk of overfitting, particularly given the sparse nature of many unstructured features. This process was laborious due to the characteristics of the underlying language model.
These selected CAs were transformed by one-hot-encoding, where a value of 1 indicated the presence of the CA for hospitalization, and 0 indicated its absence. The data were merged into a structured dataset using the admission sequence number (seq_atendimento) as the index. Structured dataset records were retained (left merge), and the selected unstructured features were added. For cases where information for an unstructured feature was absent, it was considered zero, indicating that the CA was not present for that patient. This process resulted in a dataset containing both structured and unstructured information, known as a hybrid dataset, which was were used to construct the hybrid models.
Mortality prediction model
The modeling process was applied equally to each of the datasets to facilitate model comparisons. Figure 1 summarizes the process.
Fig. 1.
Overview of the study process from selected patients (hospitalizations) to modeling
For mortality prediction, patient transfers were considered hospital discharges. We made this decision because the primary objective was to predict in-hospital mortality, outcome information for transferred patients was unavailable, and, according to the hospital’s operational profile, patients may be transferred to support units for completion of treatment.
The resulting datasets, structured and hybrid, were divided into a training dataset and a test dataset (15% of the data) using stratified sampling. To prevent data leakage, we used the test dataset only at the final stage to evaluate the performance of the selected best models without any prior processing.
The training dataset was balanced to address class imbalance, as the mortality class represented 25% of the total cases. Random undersampling of the majority class (discharge) was employed. We chose this method to reduce the risk of model overfitting [40] and to increase the model’s sensitivity to the minority class (mortality), compelling models to learn patterns more effectively on the critical outcome.
This balanced training dataset was used for the initial screening of the following machine learning models: logistic regression, random forest, gradient boosting, AdaBoost classifier, extra trees classifier, support vector machine, k-neighbors classifier, Gaussian naive Bayes, decision tree, XGB classifier, and LGBM classifier. The rationale for exploring this diverse set of algorithms was to evaluate a range of linear and nonlinear modeling approaches. A broad selection of models was chosen with the primary objective of comprehensively capture diverse underlying data patterns and identify the most suitable model for mortality prediction without a priori selection bias. This approach facilitated a rigorous head-to-head comparison to specifically assess how different model architectures respond to the incorporation of unstructured data.
For this initial algorithm screening, a 10-fold cross-validation procedure was applied directly to the balanced training dataset. Within this process, each algorithm was trained and evaluated ten times (once per fold), and the average area under the receiver operating characteristic curve (AUC ROC) across these folds was calculated. The average cross-validated AUC ROC served as the primary metric to rank the algorithms.
Six algorithms that demonstrated superior AUC ROC values (greater than 0.90) during the cross-validated screening phase were selected for subsequent, more intensive training and hyperparameter optimization. This involved a grid search approach, with specific hyperparameter grids tailored to each model type. The tuning process was again conducted using 10-fold cross-validation on the training dataset, with the AUC ROC as the primary metric for selecting the optimal hyperparameter combinations.
The finalized, tuned six models were then evaluated on the held-out test dataset. Model validation metrics on the test dataset included AUC ROC, sensitivity, specificity, precision, F1-score, accuracy, and Brier score loss. The metrics are reported with 95% confidence intervals (CI). The AUC ROC results for the structured versus hybrid models was compared using the DeLong method. McNemar’s test was employed to assess the comparative performance in overall classification. A significance level of 5% was applied to both statistical tests.
For each model, in addition to these metrics, ROC curves, confusion matrices, and calibration curves were plotted based on the test set performance. Each model generated a predicted probability of in-hospital mortality. While AUC ROC and Brier score loss were assessed directly from these probabilities, classification metrics such as sensitivity and specificity were calculated by applying a standard 0.5 decision threshold, as no specific clinical actionability threshold was predefined for this exploratory study.
Finally, SHAP (Shapley Additive Explanations) was applied to assess the feature importance and explain the predictions of the best-performing machine learning model on the test dataset.
These prediction mortality modeling processes resulted in six structured and hybrid models.
The following software was used: Tableau Desktop version 2024.3.4 for data visualization, Python version 3.11.5 for data processing and modeling and IBM SPSS version 29.0.2.0. The final model hyperparameter details are presented in Table S4 in the Supplementary material.
Ethical considerations
This study was approved by the Hospital das Clínicas Research Ethics committee, Ribeirão Preto Faculty of Medicine (CAAE:75360523.7.0000.5440) and was conducted in accordance with the principles of the Declaration of Helsinki. The need for individual informed consent was waived by the ethics committee because of the retrospective nature of the study and the use of de-identified data.
Results
A total of 844 hospitalizations were included in the study, with 600 (71.1%) patients discharged and 244 (28.9%) with in-hospital mortality (Fig. 2). There were 140 hospitalizations of patients under 18 years of age, with two deaths. The baseline characteristics of the patients, stratified by outcome, are presented in Tables 1 and 2.
Fig. 2.
Flowchart of study patients diagnosed with COVID-19
Table 1.
Characteristics variables used for training, validation and testing structured dataset for prediction of Covid-19 mortality. Continuous variables (median), categorical variables (n)
| Variable | Discharge (n = 600) | Death (n = 244) | Valor-p |
|---|---|---|---|
| Continuous | |||
| Median (interquartile range) a | |||
| Age | 53 (24–65) | 65 (57–74) | < 0.001 |
| C-reactive protein | 5.10 (2.50–8.75) | 9.65 (6.34–16.25) | < 0.001 |
| Creatinine | 0.92 (0.67–1.08) | 1.62 (1.06–2.73) | < 0.001 |
| Lactate | 2.05 (1.60–2.25) | 2.30 (1.90–2.85) | < 0.001 |
| Glutamic-oxaloacetic transaminase | 44.00 (31.00–53.00) | 57.45 (37.89–93.05) | < 0.001 |
| Urea | 48.41 (31.84–66.16) | 104.53 (74.74-135.76) | < 0.001 |
| Blood gas analysis BE | -2.12 (-3.36-0.20) | -2.90 (-6.81–1.30) | < 0.001 |
| Blood gas analysis HCO3 | 22.90 (21.20-24.85) | 22.62 (19.27–24.60) | 0.026 |
| Blood gas analysis PCO2 | 39.65 (34.39-42.00) | 42.30 (36.74–49.67) | < 0.001 |
| Blood gas analysis PH | 7.38 (7.36–7.42) | 7.34 (7.26–7.39) | < 0.001 |
| Blood gas analysis PO2 | 73.90 (66.79–83.50) | 73.80 (66.04–82.30) | 0.436 |
| Blood gas analysis SaO2 | 93.95 (92.55–95.80) | 93.38 (90.75-95.00) | < 0.001 |
| Heart rate | 88 (79–103) | 90 (82–101) | 0.226 |
| Respiratory rate | 20.00 (20.00–24.00) | 25.00 (22.00–30.00) | < 0.001 |
| Diastolic blood pressure | 70.50 (66.00–78.00) | 66.00 (61.00-71.62) | < 0.001 |
| Systolic blood pressure | 120.00 (111.00-129.00) | 119.00 (110.00-127.12) | 0.326 |
| Body temperature | 36.00 (36.00–36.00) | 36.00 (36.00–36.00) | 0.293 |
| Categorical N (%) b | |||
| Gender female | 269 (44.8) | 100 (41.0) | 0.344 |
| Blood culture performed | 161 (26.8) | 138 (56.6) | < 0.001 |
| Urine culture performed | 141 (23.5) | 119 (48.8) | < 0.001 |
| qsofa > = 2 | 146 (24.33) | 174 (71.31) | < 0.001 |
Mann-Whitney U test (a) was performed for continuous variables and Qui-square test (b) for categorical variables, except for observations less than 5 which Fischer test was performed. Significance level = 5%
Table 2.
Characteristics variables used for training, validation and testing unstructured dataset for prediction of Covid-19 mortality. Number of clinical assertions (N)
| Unstructured variables | Discharge (n = 600) | Death (n = 244) | P value |
|---|---|---|---|
| Categorical | |||
| n(%) | |||
| Diagnosis | |||
| to_clarify_diagnosis_to_clarify_ acute_pulmonary_edema | 1 (0.2%) | 7 (2.9%) | < 0.001 |
| to_clarify_diagnosis_to_clarify_shock | 2 (0.3%) | 9 (3.7%) | < 0.001 |
| has_diagnosis_affirmed_heart_failure | 2 (0.3%) | 8 (3.3%) | < 0.001 |
| to_clarify_diagnosis_to_clarify_ chronic_kidney_disease | 3 (0.5%) | 8 (3.3%) | 0.003 |
| to_clarify_diagnosis_to_clarify_copd | 6 (1.0%) | 11 (4.5%) | 0.002 |
| to_clarify_diagnosis_to_clarify_ congestive_heart_failure | 5 (0.8%) | 9 (3.7%) | 0.006 |
| uses_medication_affirmed_sedation | 2 (0.3%) | 8 (3.3%) | < 0.001 |
| uses_medication_affirmed_atorvastatin | 2 (0.3%) | 7 (2.9%) | 0.003 |
| uses_medication_affirmed_ vasoactive_drug | 16 (2.7%) | 39 (16.0%) | < 0.001 |
| uses_medication_affirmed_furosemide | 16 (2.7%) | 26 (10.7%) | < 0.001 |
| uses_medication_affirmed_spironolactone | 6 (1.0%) | 10 (4.1%) | 0.005 |
| uses_medication_affirmed_mask_reservoir | 1 (0.2%) | 9 (3.7%) | < 0.001 |
| Symptom, Procedures and laboratory tests | |||
| history_of_affirmed_angioplasty | 6 (1.0%) | 11 (4.5%) | 0.002 |
| history_of_affirmed_bronchospasm | 8 (1.3%) | 13 (5.3%) | 0.002 |
| has_symptom_affirmed_ hemodynamically_unstable | 8 (1.3%) | 11 (5.7%) | 0.001 |
| has_symptom_affirmed_hypotension | 11 (1.8%) | 18 (7.4%) | < 0.001 |
| performed_procedure_affirmed_ orotracheal_intubation | 60 (10.0%) | 70 (28.7%) | < 0.001 |
| performed_procedure_affirmed_ start_mechanical_ventilation | 52 (8.7%) | 55 (22.5%) | < 0.001 |
| performed_procedure_affirmed_ cardiorespiratory_arrest | 1 (0.2%) | 6 (2.5%) | 0.003 |
| has_exam_finding_affirmed_clcr (3) | 12 (2.0%) | 15 (6.1%) | 0.004 |
| Demographics | |||
| has_age_affirmed_73_years | 4 (0.7%) | 8 (3.3%) | 0.007 |
Fisher’s Exact test was performed for all variables. (1) Chronic Obstructive Pulmonary Disease. (2) the reservoir mask object was incorrectly assigned to the medication relationship. (3) creatinine clearance
Significant differences were observed in baseline continuous variables. Patients who died were notably older (median age: 65 vs. 53 years) and presented higher median levels of C-reactive protein (9.65 vs. 5.10), creatinine (1.62 vs. 0.92), glutamic-oxaloacetic transaminase (57.45 vs. 44.00), and urea (104.53 vs. 48.41) than those who were discharged. Arterial blood gas analysis revealed greater physiological derangement in the death group, with a significantly lower pH (median 7.34 vs. 7.38) and significantly higher PCO2 (median 42.30 vs. 39.65 mmHg). These findings were accompanied by evidence of more pronounced metabolic acidosis in the death group than in the discharged group, as indicated by a more negative base excess (BE) (median − 2.90 mEq/L vs. -2.12 mEq/L) and lower bicarbonate (HCO3) levels (median 22.62 mEq/L vs. 22.90 mEq/L) (Table 1).
Among the categorical structured variables, a qSOFA score of 2 or higher was more prevalent in patients who died (71.31% vs. 24.33%). Blood cultures (56.6% vs. 26.8%) and urine cultures (48.8% vs. 23.5%) were also significantly more common in the nonsurvivor group. There was no difference in sex between the two groups (Table 1).
Analysis of unstructured data revealed significant associations with mortality. The documented use of vasoactive drugs (16.0% of deaths vs. 2.7% at discharge) and furosemide (10.7% vs. 2.7%) was significantly greater in patients who died. Similarly, suspected diagnoses included chronic obstructive pulmonary disease (4.5% vs. 1.0%), heart failure (3.3% vs. 0.3%), kidney disease (3.3% vs. 0.5%) and procedures like orotracheal intubation (28.7% vs. 10.0%) and mechanical ventilation (22.5% vs. 8.7%) (Table 2).
The models trained with structured dataset features as predictors for in-hospital mortality demonstrated high discrimination in the training phase, with the random forest model achieving the highest training AUC ROC (0.9344, 95% CI 0.9111–0.9576), extra trees achieving the highest sensitivity (0.8652, 95% CI 0.8208–0.9097), and K-nearest neighbors (KNN) algorithm achieving the highest specificity (0.8750, 95% CI 0.8194–0.9306) and precision (0.8658, 95% CI 0.8190–0.9127).
For the test dataset, the extra trees model achieved the highest AUC ROC (0.9291, 95% CI 0.8805–0.9716) with 0.8108 (95% CI 0.6750–0.9334) sensitivity, 0.8000 (95% CI 0.7159–0.8778) specificity and 0.6250 (95% CI 0.4864–0.7551) precision. The highest sensitivity was achieved by XGBoost and LightGBM (0.8378, 95% CI 0.7143–0.9474), and the highest specificity and precision were achieved by random forest and gradient boosting (0.8667, 95% CI 0.7957–0.9348) and 0.7143 (95% CI 0.5750–0.8462) respectively). Among these, the extra trees model presented the lowest Brier score Loss (0.1162, 95% CI 0.0900-0.1459), indicating superior calibration. Despite better metrics in terms of specificity and precision, the random forest model, performed poorly in sensitivity metrics Table 3.
Table 3.
Performance metrics using structured variables to predict COVID-19 mortality
| Models | ROC AUC | Sensitivity (Recall) | Specificity | Precision (PPV) | F1-Score | Brier Score Loss (↓) | Acuracy |
|---|---|---|---|---|---|---|---|
| Training | |||||||
| Extra Trees |
0.9244 (0.9036–0.9452) |
0.8652 (0.8208–0.9097) |
0.8062 (0.7299–0.8825) |
0.8264 (0.7649–0.8880) |
0.8412 (0.8123-0.8700) |
0.1183 (0.1070–0.1297) |
0.8355 (0.8028–0.8682) |
| Gradient Boosting |
0.9270 (0.8992–0.9547) |
0.8652 (0.8081–0.9224) |
0.8300 (0.7718–0.8882) |
0.8410 (0.7967–0.8853) |
0.8498 (0.8141–0.8855) |
0.1098 (0.0886–0.1310) |
0.8477 (0.8139–0.8814) |
| Random Forest |
0.9344 (0.9111–0.9576) |
0.8650 (0.7975–0.9325) |
0.8260 (0.7433–0.9086) |
0.8438 (0.7812–0.9064) |
0.8475 (0.8079–0.8872) |
0.1135 (0.1003–0.1267) |
0.8454 (0.8070–0.8837) |
| Support Vector Machine (SVC) |
0.9337 (0.9149–0.9526) |
0.8552 (0.7975–0.9130) |
0.8357 (0.7691–0.9024) |
0.8482 (0.7929–0.9035) |
0.8462 (0.8197–0.8726) |
0.1081 (0.0940–0.1222) |
0.8454 (0.8190–0.8717) |
| LightGBM |
0.9314 (0.9098–0.9529) |
0.8457 (0.7834–0.9080) |
0.8450 (0.7914–0.8986) |
0.8513 (0.8130–0.8897) |
0.8441 (0.8172–0.8709) |
0.1147 (0.0959–0.1335) |
0.8455 (0.8240–0.8669) |
| XGBoost |
0.9321 (0.9066–0.9577) |
0.8410 (0.7878–0.8941) |
0.8400 (0.7671–0.9129) |
0.8511 (0.7884–0.9137) |
0.8406 (0.8103–0.8709) |
0.1161 (0.0916–0.1407) |
0.8404 (0.8085–0.8724) |
| AdaBoost |
0.9085 (0.8814–0.9356) |
0.8267 (0.7668–0.8865) |
0.8395 (0.7731–0.9059) |
0.8454 (0.7892–0.9016) |
0.8316 (0.7912–0.8720) |
0.1937 (0.1876–0.1998) |
0.8331 (0.7940–0.8722) |
| Logistic regression |
0.9017 (0.8812–0.9223) |
0.8019 (0.7172–0.8866) |
0.8405 (0.7738–0.9071) |
0.8427 (0.7885–0.8969) |
0.8141 (0.7635–0.8647) |
0.1276 (0.1118–0.1434) |
0.8211 (0.7810–0.8613) |
| Decision Tree |
0.8068 (0.7740–0.8396) |
0.7829 (0.7103–0.8554) |
0.8307 (0.7839–0.8776) |
0.8265 (0.7856–0.8673) |
0.7994 (0.7592–0.8396) |
0.1933 (0.1605–0.2260) |
0.8067(0.7740–0.8395) |
| K-Nearest Neighbors (KNN) |
0.8999 (0.8729–0.9269) |
0.7686 (0.7241–0.8130) |
0.8750 (0.8194–0.9306) |
0.8658 (0.8190–0.9127) |
0.8108 (0.7870–0.8347) |
0.1309 (0.1157–0.1461) |
0.8211 (0.7990–0.8433) |
| Gaussian Naive Bayes |
0.8780 (0.8446–0.9113) |
0.7395 (0.6947–0.7843) |
0.8162 (0.7560–0.8764) |
0.8080 (0.7565–0.8596) |
0.7688 (0.7397–0.7978) |
0.1815 (0.1590–0.2039) |
0.7777 (0.7475–0.8079) |
| Testing | |||||||
| Support Vector Machine (SVC) |
0.8955 (0.8388–0.9496) |
0.7838 (0.6486–0.9062) |
0.8444 (0.7634–0.9158) |
0.6744 (0.5319–0.8044) |
0.7250 (0.6087–0.8205) |
0.1253 (0.0869–0.1643) |
0.8268 (0.7638–0.8898) |
| Extra Trees |
0.9291 (0.8805–0.9716) |
0.8108 (0.6750–0.9334) |
0.8000 (0.7159–0.8778) |
0.6250 (0.4864–0.7551) |
0.7059 (0.5897–0.8039) |
0.1162 (0.0900-0.1459) |
0.8031 (0.7323–0.8740) |
| Random Forest |
0.9170 (0.8672–0.9636) |
0.8108 (0.6750–0.9334) |
0.8667 (0.7957–0.9348) |
0.7143 (0.5750–0.8462) |
0.7595 (0.6470–0.8537) |
0.1194 (0.0907–0.1475) |
0.8504 (0.7953–0.9134) |
| XGBoost |
0.9018 (0.8398-0.9600) |
0.8378 (0.7143–0.9474) |
0.8444 (0.7674–0.9192) |
0.6889 (0.5455–0.8223) |
0.7561 (0.6440–0.8500) |
0.1272 (0.0818–0.1715) |
0.8425 (0.7795–0.9055) |
| Gradient Boosting |
0.8856 (0.8159–0.9518) |
0.8108 (0.6756–0.9311) |
0.8667 (0.7931–0.9375) |
0.7143 (0.5814–0.8485) |
0.7595 (0.6506–0.8533) |
0.1332 (0.0845–0.1835) |
0.8504 (0.7874–0.9134) |
| LightGBM |
0.9078 (0.8498–0.9615) |
0.8378 (0.7143–0.9474) |
0.8444 (0.7674–0.9140) |
0.6889 (0.5581–0.8223) |
0.7561 (0.6486–0.8511) |
0.1219 (0.0811–0.1627) |
0.8425 (0.7795–0.9055) |
Bold – better metrics
Considering only adult patient hospitalizations (older than 18 years) and the test dataset, there was an improvement in the AUC ROC metrics for all the models, with gradient boosting performing best (AUC ROC 0.9603). With respect to sensitivity and specificity, we observed an increase in sensitivity and a reduction in specificity for the models. The extra trees model was an exception, with increases in both metrics (0.9722 and 0.8143, respectively) compared with those of the general model, which considered all hospitalizations. There was an increase in model precision (LightGBM: 0.7949) and a reduction in the Brier score (gradient boosting 0.0836). (Table S2 Supplementary material)
The hybrid models for predicting in-hospital mortality (Table 4), in the training dataset, demonstrated that the gradient boosting model had the highest AUC ROC (0.9421, 95% CI 0.9207–0.9634), sensitivity (0.8938, 95% CI 0.8281–0.9595), and lowest Brier score loss (0.1011, 95%.CI 0.0773–0.1248). Gaussian naïve Bayes demonstrated the highest specificity (0.9179, 95% CI 0.8711–0.9647) but the lowest sensitivity (0.4879, 95% CI 0.3938–0.5819).
Table 4.
Performance metrics of hybrid model (structured and unstructured variables) to predict COVID-19 mortality
| Models | ROC AUC | Sensitivity (Recall) | Specificity | Precision (PPV) | F1-Score | Brier Score Loss (↓) | Acuracy |
|---|---|---|---|---|---|---|---|
| Training | |||||||
| Extra Trees |
0.9380 (0.9196–0.9565) |
0.8507 (0.7818–0.9197) |
0.8457 (0.7710–0.9204) |
0.8551 (0.7944–0.9158) |
0.8473 (0.8058–0.8887) |
0.1122 (0.1017–0.1227) |
0.8480 (0.8084–0.8875) |
| Gradient Boosting |
0.9421 (0.9207–0.9634) |
0.8938 (0.8281–0.9595) |
0.8448 (0.7797–0.9098) |
0.8595 (0.8108–0.9081) |
0.8714 (0.8363–0.9065) |
0.1011 (0.0773–0.1248) |
0.8695 (0.8367–0.9024) |
| Random Forest |
0.9338 (0.9109–0.9568) |
0.8357 (0.7819–0.8895) |
0.8360 (0.7567–0.9152) |
0.8456 (0.7846–0.9065) |
0.8357 (0.8001–0.8713) |
0.1139 (0.1015–0.1262) |
0.8357 (0.7983–0.8730) |
| Support Vector Machine (SVC) |
0.9361 (0.9069–0.9653) |
0.8798 (0.8217–0.9378) |
0.8357 (0.7583–0.9131) |
0.8537 (0.7907–0.9168) |
0.8608 (0.8302–0.8914) |
0.1064 (0.0857–0.1271) |
0.8575 (0.8246–0.8905) |
| LightGBM |
0.9392 (0.9147–0.9637) |
0.8745 (0.8316–0.9175) |
0.8498 (0.7945–0.9050) |
0.8591 (0.8147–0.9034) |
0.8638 (0.8391–0.8884) |
0.1052 (0.0793–0.1312) |
0.8621 (0.8372–0.8870) |
| XGBoost |
0.9329 (0.9088–0.9571) |
0.8505 (0.7986–0.9023) |
0.8448 (0.7817–0.9078) |
0.8536 (0.8037–0.9035) |
0.8478 (0.8205–0.8751) |
0.1127 (0.0890–0.1364) |
0.8477 (0.8216–0.8738) |
| AdaBoost |
0.9190 (0.8923–0.9456) |
0.8310 (0.7746–0.8873) |
0.8450 (0.7825–0.9075) |
0.8496 (0.7999–0.8994) |
0.8362 (0.8028–0.8695) |
0.1944 (0.1884–0.2004) |
0.8379 (0.8058–0.8701) |
| Logistic regression |
0.9216 (0.9002–0.9431) |
0.8169 (0.7182–0.9156) |
0.8500 (0.7826–0.9174) |
0.8569 (0.8031–0.9107) |
0.8254 (0.7727–0.8781) |
0.1204 (0.1029–0.1378) |
0.8334 (0.7992–0.8677) |
| Decision Tree |
0.7994 (0.7502–0.8486) |
0.7729 (0.6883–0.8574) |
0.8260 (0.7819-0.8700) |
0.8163 (0.7725–0.8601) |
0.7906 (0.7334–0.8478) |
0.2005 (0.1514–0.2497) |
0.7995 (0.7503–0.8486) |
| K-Nearest Neighbors (KNN) |
0.8777 (0.8457–0.9097) |
0.7257 (0.6697–0.7817) |
0.8838 (0.8271–0.9405) |
0.8690 (0.8079–0.9302) |
0.7871 (0.7471–0.8270) |
0.1476 (0.1292–0.1660) |
0.8042 (0.7676–0.8408) |
| Gaussian Naive Bayes |
0.8533 (0.8078–0.8988) |
0.4879 (0.3938–0.5819) |
0.9179 (0.8711–0.9647) |
0.8555 (0.7687–0.9422) |
0.6148 (0.5275–0.7022) |
0.2950 (0.2349–0.3550) |
0.7029 (0.6429–0.7629) |
| Testing | |||||||
| Support Vector Machine (SVC) |
0.8808 (0.8089–0.9412) |
0.8108 (0.6667–0.9333) |
0.8556 (0.7821–0.9256) |
0.6977 (0.5500-0.8298) |
0.7500 (0.6400–0.8500) |
0.1393 (0.0969–0.1854) |
0.8425 (0.7795–0.9055) |
| Extra Trees |
0.9153 (0.8598–0.9642) |
0.8378 (0.7143–0.9474) |
0.8333 (0.7561–0.9136) |
0.6739 (0.5370–0.8182) |
0.7470 (0.6364-0.8500) |
0.1191 (0.0900-0.1491) |
0.8346 (0.7717–0.9055) |
| Random Forest |
0.9260 (0.8794–0.9687) |
0.8378 (0.7143–0.9474) |
0.8667 (0.8023–0.9368) |
0.7209 (0.5833–0.8537) |
0.7750 (0.6667–0.8657) |
0.1181 (0.0905–0.1454) |
0.8583 (0.7953–0.9213) |
| XGBoost |
0.8949 (0.8300-0.9548) |
0.8378 (0.7143–0.9474) |
0.8556 (0.7805–0.9260) |
0.7045 (0.5625–0.8286) |
0.7654 (0.6557–0.8571) |
0.1258 (0.0821–0.1699) |
0.8504 (0.7874–0.9134) |
| Gradient Boosting |
0.8826 (0.8167–0.9452) |
0.8108 (0.6756–0.9311) |
0.8667 (0.7957–0.9341) |
0.7143 (0.5833-0.8500) |
0.7595 (0.6505-0.8600) |
0.1335 (0.0823–0.1863) |
0.8504 (0.7874–0.9134) |
| LightGBM |
0.8913 (0.8293–0.9507) |
0.8108 (0.6765–0.9355) |
0.8444 (0.7677–0.9186) |
0.6818 (0.5365–0.8182) |
0.7407 (0.6197–0.8355) |
0.1351 (0.0884–0.1849) |
0.8346 (0.7717–0.8976) |
Bold – better metrics
Upon evaluation with the test dataset, the random forest model achieved the best evaluation across all the metrics – the highest AUC ROC (0.9260, 95% CI 0.8794–0.9687) and a high sensitivity of 0.8378 (95% CI, 0.7143–0.9474) specificity (0.8667, 95% CI 0.8023–0.9368), precision (0.7209, 95% CI 0.5833–0.8537) and the lowest Brier score (0.1181, 95% CI 0.0905–0.1454) among the hybrid models. Figure 3a and b shows receiver operating characteristic (ROC) and precision-recall curves and corresponding areas under each curve for Structured and Hybrid Random Forest models. The calibration plot (Fig. 3c) for this model, while generally well-calibrated across most probability ranges, also tended to underestimate the most probable deaths, particularly for high-risk patients, a behaviour similar to that observed in the structured models.
Fig. 3.
(a). Prediction Ability of ML structured variables-based and hybrid model for in hospital covid-19 death. (a) Receiver operating characteristic curves. (b). Prediction Ability of ML structured variables-based and hybrid model for in hospital covid-19 death. (b) Precision-recall curve. (c). Prediction Ability of ML structured variables-based and hybrid model for in hospital covid-19 death. (c) calibration plot
The inclusion of unstructured data improved the sensitivity of the SVC, extra trees and random forest. Specificity was improved in SVC, extra tress and XGBoost models. All the models, except random forest have a decrease in AUC ROC metrics. However, this observed differences were not statistically significant across all the AUC ROC metrics for all the models (Table 5).
Table 5.
Results comparison structured and hybrid models for COVID-19 mortality prediction
| Models | ROC AUC | Sensitivity (Recall) | Specificity | |||||
|---|---|---|---|---|---|---|---|---|
| Structured model | Hybrid model | DeLong (p value) |
Structured model | Hybrid model | Structured model | Hybrid model | McNemar test (p value) | |
| Support Vector Machine (SVC) | 0,8955 | 0,881 | 0.4684 | 0,7838 | 0,8108 | 0,8444 | 0,8556 | 0.7518 |
| Extra Trees | 0,9291 | 0,915 | 0.1201 | 0,8108 | 0,8378 | 0,8000 | 0,8333 | 0.3428 |
| Random Forest | 0,9170 | 0,9260 | 0.0924 | 0,8108 | 0,8378 | 0,8667 | 0,8667 | 1 |
| XGBoost | 0,9018 | 0,895 | 0.4601 | 0,8378 | 0,8378 | 0,8444 | 0,8556 | 1 |
| Gradient Boosting | 0,8856 | 0,883 | 0.7084 | 0,8108 | 0,8108 | 0,8667 | 0,8667 | 1 |
| LightGBM | 0,9078 | 0,891 | 0.0307 | 0,8378 | 0,8108 | 0,8444 | 0,8444 | 1 |
Analysing the hybrid models considering only adult hospitalizations and the test dataset, we observed that all the models showed an increase in the AUC ROC, with gradient boosting reaching an AUC ROC of 0.9591. There was also an increase in sensitivity, except for the SVM model. Of the six models tested, three had a reduction in specificity. The best sensitivity occurred in the extra trees model (0.9722 with a specificity of 0.8000), and the best specificity occurred in the LightGBM model (0.8857 with a sensitivity of 0.8611). This model also presented the best accuracy (0.7949) (Table S3 in the Supplementary material).
The inclusion of unstructured data altered the performance landscape: transitioning from a scenario with structured data where leading metrics were dispersed among multiple models to a model - random forest - emerging as the primary top-performer across most critical metrics with gains in sensitivity (0.8378 vs. 0.8108), precision (0.7209 vs. 0.7143) and AUC ROC (0.9260 vs. 0.9170) and a slight decrease in Brier score (0.1181 vs. 0.1162). However, despite these numerical improvements, statistical tests (DeLong’s test for AUC ROC and McNemar’s test for classification metrics) showed that these gains were not statistically significant (Table 5). The same effect was not observed in the models considering only adult hospitalizations.
SHAP values provided insight into the interpretability of the random forest model. The mean absolute SHAP values, which indicate the average impact of each feature on the random forest model’s output magnitude, are presented in Figs. 4 and 5. The analysis revealed that urea had the most significant impact on model predictions, followed by the respiratory rate and C-reactive protein level. Other features demonstrating importance included aspartate aminotransferase, creatinine, age, diastolic blood pressure, and the qSOFA. Features such as blood culture, lactate, and urine culture also contributed to the predictions, albeit with a lesser magnitude than the top-ranked variables did. The only unstructured parameter identified as having a positive, although minor, mean impact on the model output was the use of furosemide. Higher values of urea, respiratory frequency, C-reactive protein and age were associated with an increase likelihood of COVID-19 death. A lower diastolic blood pressure also contributed to an increased likelihood of COVID-19 death in patients who underwent blood and urine cultures.
Fig. 4.
SHAP variable importance chart, displaying the included features arranged in descending order based on their average absolute SHAP values
Fig. 5.
SHAP Beeswarm plot of variable importance chart, displaying the included features arranged in descending order based on their average absolute SHAP values
Discussion
We evaluated the impact of incorporating unstructured data into mortality prediction models, comparing them head-to-head with models that use only structured data. The structured data included those from laboratory tests and patient monitoring. In contrast, the unstructured data were extracted from the patient history of present illness notes from their first encounter during hospitalization using a few-shot learning strategy. Patients diagnosed with COVID-19 were included, given the significance of long COVID-19 cases. We observed an improvement in metrics with the inclusion of unstructured data, with these improvements particularly concentrated in a single model. However, the explainability model classified only one of the unstructured variables as important, and these improvements were not statistically significant.
Mortality prediction is a critical element, especially in severely ill patients and those in intensive care. Various risk scores exist for mortality prediction that solely utilize structured data. For example, the Sequential Organ Failure Assessment (SOFA) score is used to predict organ failure, sepsis, and mortality [39] and the Simplified Acute Physiology Score (SAPS) is used for mortality prediction in intensive care patients [41]. These scores are commonly criticized for not leveraging unstructured data from EHRs and for exhibiting moderate performance [28]. For example, the AUC ROC of the SOFA score for the prediction of death was 0.74 (95% CI 0.73–0.76) for patients with suspected infection who were admitted to the ICU, whereas the SAPS score had a reported sensitivity and specificity of 0.69 for ICU mortality. Our structured model showed AUC-ROC of 0.93 (95% CI 0.88–0.97) which may confirm this hypothesis.
Data generation from electronic health records (EHRs) is increasing [42] and the importance of their content has justified their use in artificial intelligence models. However, structured data are widely utilized even when these data are extracted from EHRs [24]. Although advocated as important, the use of unstructured data and the evaluation of the impacts of incorporating such data into baseline models with structured data, is rare. A notable exception is the work of Sung et al., who extracted data from the history of the present illness and texts from computed tomography (CT) scan reports. Using two methods, Bag-of-Words and Word2Vec, the authors demonstrated that the hybrid model improved the prediction of 90-day functional outcomes for patients after acute ischemic stroke (AIS); they used AIS-specific risk scores as a baseline [43]. Our work presents methodological distinctions from the study by Sung et al. We concentrated on a hard clinical endpoint—in-hospital mortality—and employed clinical assertions (CAs) for data extraction, a method that improves interpretability by preserving semantic relationships. Furthermore, our research is distinguished by its use of a head-to-head comparison framework, which was designed to evaluate whether hybrid models with unstructured data offer a viable replacement for traditional predictive approaches.
Although the inclusion of unstructured data led to improved model metrics, particularly in terms of sensitivity for four of the six models tested, the head-to-head comparison revealed that none of these performance gains were statistically significant, a finding that contradicts results presented by other authors [19, 44]. Kuo et al. reported significant variations in metrics in an emergency room disposition prediction model using structured and unstructured data compared with only structured data; however, the authors did not perform a direct comparison as we did in this study or report the model specificity or statistical tests to validate the difference in model performance [44].
Zhang et al. conducted a direct comparison of in-hospital mortality prediction by combining structured and unstructured data across three different models. Our findings diverge from those of Zhang et al., who reported statistically significant improvements when unstructured data were incorporated. However, a direct comparison between studies is challenging, primarily due methodological differences. First, Zhang et al. did not report sensitivity and specificity, metrics that are critical for evaluating performance using imbalanced clinical data and for assessing real-world clinical applicability. Second, their work is based on the MIMIC-III ICU dataset (data up to 2012), while our study uses more contemporary data (up to 2024) from a broader population across all hospital settings. The recent nature of our data captures modern treatment protocols and technologies registered in EHRs [19]. Finally, our method extracting CAs improves the explicability of the models.
Despite this, a direct comparison of the best-performing models reveals that our hybrid random forest model surpassed the authors’ model Fusion-LSTM for in-hospital mortality across the available metrics - AUROC (0.926 vs. 0.871), AUPRC (0.860 vs. 0.250) and F1-score (0.775 vs. 0.424) [19].
Our results demonstrate that the inclusion of unstructured data does not significantly alter the models’ performance. However, this finding must be nuanced by some factors. First, while widely used, DeLong’s test for statistical comparison can have limitations. Second, unstructured data, by its very nature, is free to incorporate technological and therapeutic novelties not always captured in the structured fields of a medical record. Third, the lack of statistical significance also indicates that the hybrid model is not inferior to the model based solely on structured data. This model has the advantage of capturing new data not always captured by structured input data. These factors, combined with the observation that the use of unstructured data concentrated the best performance metrics within a single model, suggest a high potential for these hybrid approaches. This is especially true in a context where new language models are continuously emerging that could improve the extraction method while retaining the crucial advantage of explainability from the extracted clinical assertions.
The random forest model showed balanced performance, correctly predicting 83.8% of death cases (sensitivity) and 86.7% of discharge cases (specificity), with a precision of 72.1%. This level of performance was not achieved by the models based solely on structured data. Random forest is an ensemble-type model that creates multiple individual trees, with the result obtained by aggregating the outputs of these individual trees. One of the premises for its creation was to leverage the predictive power of complex individual trees while mitigating their tendency to overfit training data by ensemble averaging. To address this limitation, Ho proposed that each initial tree be trained on a random sample of the dataset, stating that, according to the author, “the vast number of subspaces in high dimensional feature spaces provides more choices“ [45]. This initial randomness may enhance the model’s capacity to capture the nuances within unstructured data. We observed that the variables utilized have low representativeness, resulting in a sparse matrix. The random selection process for each tree in the random forest algorithm may be better suited to handle this characteristic. Another possibility involves impurity reduction using the Gini method.
We did not find recent studies performing head-to-head comparisons of models using both structured and unstructured data for mortality prediction, thus precluding direct comparisons.
However, we can make some comparisons with structured models. The structured models we created performed better than those presented by Dong et al. [46]. Their study analysed critically ill patients infected with COVID-19 and reported a slightly lower mortality rate than our study (26.5% vs. 28.9%). Considering the ROC AUC metric, our extra trees model outperformed the authors (0.93 vs. 0.85), with lower sensitivity (0.81 vs. 0.82) but higher specificity (0.80 vs. 0.77). This difference can be attributed to the method of selecting laboratory test values. While we selected the median values for patients, Dong et al. selected the first measurement. Another explanation is that we were more restrictive in excluding missing values; we excluded records with more than 15% missing values, whereas they excluded those with more than 30% missing values. Their final model had only five features, whereas our had 21 features. Dong et al. identified five tests as explanatory for their results, none of which were the same as ours. Our leading indicators identified by SHAP, such as urea and creatinine (kidney biomarkers), aspartate aminotransferase, and C-reactive protein [47] are consistent with the literature as markers of severity for COVID-19. Age is also a factor that may be associated with the presence of comorbidities [48].
Despite promising results, a random forest-type model for mortality prediction faces challenges in simply replacing traditional risk scores used in daily practice. This challenge is primarily due to the difficulty in clearly identifying the main explanatory variables for mortality prediction and explaining such a model to healthcare professionals, who are more familiar with current scores based on laboratory tests – the “black box model”. While traditional scores provide precise cutoff values for laboratory variables, for example, explainable AI (XAI) models offer importance metrics without such distinct thresholds. The use of CAs could improve the explainability of the prediction models.
Brier scores, despite indicating generally good calibration for random forest model, revealed a tendency to underestimate the probability of mortality, particularly for patients at the highest risk which could difficult the model implementation on clinical practice. Improving calibration, especially at the extremes, is essential for clinical decision-making, enabling healthcare professionals to rely more confidently on the predicted probabilities for patient stratification and resource allocation. However, as demonstrated in this study, it is important that new data sources be utilized to enhance mortality prediction metrics.
One way to overcome this adoption barrier is to use a human-in-the-loop approach at all stages of model deployment. The involvement of information professionals is important for curating and validating the NLP extraction pipeline including prompts refinement and clinical assertion consolidation, overseeing the initial extraction of clinical assertions (CAs), and performing critical consolidation tasks such as mapping similar relationships and standardizing clinical terms. The engagement of healthcare professionals who deal directly with patients is also critical to address their concerns regarding the applicability of a model that they may perceive as a “black box” and to provide feedback on the model’s predictions and explanatory outputs in real-world or simulated scenarios. This can be invaluable for iterative model improvement, recalibration, and fine-tuning and co-development of clinical pathways or decision support tools that integrate the model’s outputs, ensuring practical utility, seamless integration into existing workflows, and ultimately, trust in the system. Our hybrid model holds potential for practical application, particularly for integration into emergency department triage systems. By delivering a timely and accurate prediction of in-hospital mortality, it could function as a proactive clinical decision support tool. This would enable the early flagging of high-risk COVID-19 patients, inform resource allocation by helping to prioritize individuals for higher levels of care, and guide clinicians in anticipating disease progression to facilitate earlier and potentially more effective treatment adjustments.
Furthermore, a continuous feedback loop, leveraging real-world or simulated scenarios, is deemed invaluable for iterative improvement. This ongoing process of recalibration and fine-tuning is essential to ensure the model’s practical utility, facilitate its seamless integration into clinical workflows, and ultimately, solidify user trust.
We believe that such a collaborative human-AI process, characterized by continuous interaction and feedback loops, could not only overcome the adoption barrier but also iteratively enhance model performance, safety, and trustworthiness in clinical practice.
Limitations and future work
This study has several limitations. A primary limitation of this study is its single-centre design, which restricts immediate generalizability due to the inherent specificities of the patient population, clinical protocols, and documentation practices. This is particularly relevant for the hybrid model, which is sensitive to local data recording styles. Nevertheless, the study was conducted in a major Brazilian tertiary teaching hospital, providing robust findings for local clinical application. While we recognize the limitations of this study’s generalizability, we believe that the practical application of predictive models, especially if they are intended to replace established scores, should be phased. Therefore, this is the first study, and further local studies are needed to first assess local generalizability. Another limitation is the use of median values for laboratory tests and monitoring data. This approach may favor the model’s sensitivity in mortality prediction, as more critically ill patients who progress to death are likely to have longer hospital stays and, consequently, more examinations. This limitation is somewhat mitigated by the fact that the model also performed well on discharge data and because the first clinical note record, which captures only initial patient data, was utilized. Further studies should be conducted to assess whether the use of longitudinal data, both structured and unstructured, increases predictive power.
The model’s calibration curve tends to underestimate the most probable deaths. Additional studies are necessary to verify whether the introduction of more unstructured information, captured by altering feature selection models, can improve model calibration. Another limitation of this study is the sample size of 844 hospitalizations. The authors acknowledge that while this cohort was sufficient for the single-institution exploratory analysis, it may affect the statistical stability of the models and their ability to detect more subtle associations. This sample size also inherently constrains the generalizability of the findings. Consequently, future research should focus on validating these results with significantly larger, more diverse cohorts and new LLM models to ensure model robustness and confirm predictive performance.
Finally, one of the authors reviewed the selected extracted terms, which may have introduced bias in the term correction process. However, as observed, the generated matrix was still sparse. Further research is needed to evaluate whether extracting more relevant information using medical knowledge graph-type models can reduce the time spent processing unstructured data extraction results while simultaneously improving model performance and explore advanced calibration techniques to investigate whether the incorporation of a wider range of unstructured information, or alternative feature selection strategies, can mitigate this underestimation and enhance the reliability of probability predictions.
Conclusion
This study demonstrated that the use of unstructured data extracted from medical notes does not significatively improves the predictive power for mortality in patients diagnosed with COVID-19. Despite this the hybrid model demonstrates promising results: a single model successfully consolidated the best set of metrics even using model in a restrictive computational environment. The model’s explanatory variables aligned with those observed in clinical practice which represents an opportunity in use hybrid models. The implementation of a hybrid model using data from medical records will necessitate a reassessment of medical documentation practices to establish standardized methods for capturing and managing medical information. Better models can improve the extraction of unstructured information.
Supplementary Information
Below is the link to the electronic supplementary material.
Abbreviations
- AIS
Acute ischemic stroke
- AUC ROC
Area Under the Receiver Operating Characteristic Curve
- CA
Clinical assertions
- CI
Confidence interval
- CT
Computed tomography
- HER
Electronic Health Records
- ICU
Intensive care unit
- LLM
Large Language Models
- MIMIC-III
Medical Information Mart for Intensive Care, third edition
- NLP
Natural language processing
- NLTK
Natural Language Toolkit
- qSOFA
Quick Sequential Organ Failure Assessment
- RT-PCR
Reverse transcription-polymerase chain reaction test
- SHAP
Shapley Additive Explanations
- SOFA
Sequential Organ Failure Assessment
- XAI
Explainable Artificial Intelligence
Author contributions
All authors contribute equally to this manuscript.
Funding
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. APF author CNPq Research Productivity Scholarship - Level 2 Brazil − 303187/2022-0.
Data availability
Data is provided in the manuscript and supplementary material.
Declarations
Ethical approval
This study was approved by the Hospital das Clínicas’ research ethics committee, Ribeirão Preto Faculty of Medicine (CAAE:75360523.7.0000.5440) and conducted in accordance with the principles of the Declaration of Helsinki.
Consent to participate
The need for individual informed consent was waived by Hospital das Clínicas’ research ethics committee, Ribeirão Preto Faculty of Medicine (CAAE:75360523.7.0000.5440) due to the retrospective nature of the study and the use of de-identified data.
Consent to publish
Not applicable.
Financial interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Both authors approved the final version of the article.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Fraile Navarro D, Coiera E, Hambly TW, et al. Expert evaluation of large Language models for clinical dialogue summarization. Sci Rep. 2025;15:1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hu D, Zhang S, Liu Q, et al. Large Language models in summarizing radiology report impressions for lung cancer in chinese: evaluation study. J Med Internet Res. 2025;27:e65547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.de Santis E, Martino A, Ronci F, et al. From Bag-of-Words to transformers: A comparative study for text classification in healthcare discussions in social media. IEEE Trans Emerg Top Comput Intell. 2025;9:1063–77. [Google Scholar]
- 4.Barlow SH, Chicklore S, He Y, et al. Uncertainty-aware automatic TNM staging classification for 18F Fluorodeoxyglucose PET-CT reports for lung cancer utilising transformer-based Language models and multi-task learning. BMC Med Inf Decis Mak. 2024;24:396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bardhan J, Roberts K, Wang DZ. Question answering for electronic health records: scoping review of datasets and models. J Med Internet Res. 2024;26:e53636. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Da Silva RP, Pazin-Filho A. Anonimização de textos médicos com processamento de linguagem natural. J Health Inf. 2025;17:1227. [Google Scholar]
- 7.Soman K, Rose PW, Morris JH et al. Biomedical knowledge graph-optimized prompt generation for large language models. Bioinformatics. 2024;40. [DOI] [PMC free article] [PubMed]
- 8.Da Silva RP, Pollettini JT, Pazin Filho A. Unsupervised natural language processing in the identification of patients with suspected COVID-19 infection. Cad Saúde Pública. 2023;39. [DOI] [PMC free article] [PubMed]
- 9.Gan Z, Zhou D, Rush E, et al. ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis. J Biomed Inf. 2025;162:104761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bedi S, Liu Y, Orr-Ewing L, et al. Testing and evaluation of health care applications of large Language models: A systematic review. JAMA. 2025;333:319–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Choi A, Kim C, Ryoo J, et al. A pediatric emergency prediction model using natural Language process in the pediatric emergency department. Sci Rep. 2025;15:3574. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fernandes M, Gallagher K, Turley N et al. Automated extraction of post-stroke functional outcomes from unstructured electronic health records. Eur Stroke J. 2025:23969873251314340. [DOI] [PMC free article] [PubMed]
- 13.Holmes G, Tang B, Gupta S, et al. Applications of large Language models in the field of suicide prevention: scoping review. J Med Internet Res. 2025;27:e63126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Omar M, Levkovich I. Exploring the efficacy and potential of large Language models for depression: A systematic review. J Affect Disord. 2025;371:234–44. [DOI] [PubMed] [Google Scholar]
- 15.Sugimoto K, Wada S, Konishi S et al. Automated detection of cancer-suspicious findings in Japanese radiology reports with natural language processing: a multicenter study. J Imaging Inf Med. 2025. [DOI] [PMC free article] [PubMed]
- 16.Brann F, Sterling NW, Frisch SO, et al. Sepsis prediction at emergency department triage using natural Language processing: retrospective cohort study. JMIR AI. 2024;3:e49784. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Dencker EE, Bonde A, Troelsen A et al. Assessing the utility of natural language processing for detecting postoperative complications from free medical text. BJS Open. 2024; 8. [DOI] [PMC free article] [PubMed]
- 18.Falter M, Godderis D, Scherrenberg M, et al. Using natural Language processing for automated classification of disease and to identify misclassified ICD codes in cardiac disease. Eur Heart J Digit Health. 2024;5:229–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhang D, Yin C, Zeng J, et al. Combining structured and unstructured data for predictive models: a deep learning approach. BMC Med Inf Decis Mak. 2020;20:280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Preiksaitis C, Ashenburg N, Bunney G, et al. The role of large Language models in transforming emergency medicine: scoping review. JMIR Med Inf. 2024;12:e53787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Alafari F, Driss M, Cherif A. Advances in natural Language processing for healthcare: A comprehensive review of techniques, applications, and future directions. Comput Sci Rev. 2025;56:100725. [Google Scholar]
- 22.Khan W, Leem S, See KB et al. A comprehensive survey of foundation models in medicine. IEEE Rev Biomed Eng. 2025. [DOI] [PubMed]
- 23.Carrasco-Ribelles LA, Cabrera-Bean M, Khalid S, et al. Development of Attention-based prediction models for All-cause mortality, home care need, and nursing home admission in ageing adults in Spain using longitudinal electronic health record data. J Med Syst. 2025;49:17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chen Y, Liu X, Yan M, et al. Death risk prediction model for patients with non-traumatic intracerebral hemorrhage. BMC Med Inf Decis Mak. 2025;25:35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Bilal M, Hamza A, Malik N. NLP for analyzing electronic health records and clinical notes in cancer research: A review. J Pain Symptom Manage. 2025;69:e374–94. [DOI] [PubMed] [Google Scholar]
- 26.He R, Sarwal V, Qiu X, et al. Generative AI models in Time-Varying biomedical data: scoping review. J Med Internet Res. 2025;27:e59792. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.de Silva K, Mathews N, Teede H, et al. Clinical notes as prognostic markers of mortality associated with diabetes mellitus following critical care: A retrospective cohort analysis using machine learning and unstructured big data. Comput Biol Med. 2021;132:104305. [DOI] [PubMed] [Google Scholar]
- 28.Boussen S, Benard-Tertrais M, Ogéa M, et al. Heart rate complexity helps mortality prediction in the intensive care unit: A pilot study using artificial intelligence. Comput Biol Med. 2024;169:107934. [DOI] [PubMed] [Google Scholar]
- 29.Yuan Y, Zhang G, Gu Y, et al. Artificial intelligence-assisted machine learning models for predicting lung cancer survival. Asia Pac J Oncol Nurs. 2025;12:100680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Simopoulos D, Kosmidis D, Anastassopoulos G, et al. A stacked ensemble deep learning model for predicting the intensive care unit patient mortality. AIH. 2025;2:47. [Google Scholar]
- 31.Xu L, Li C, Zhang J, et al. Personalized prediction of mortality in patients with acute ischemic stroke using explainable artificial intelligence. Eur J Med Res. 2024;29:341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Olaker VR, Fry S, Terebuh P, et al. With big data comes big responsibility: strategies for utilizing aggregated, standardized, de-identified electronic health record data for research. Clin Transl Sci. 2025;18:e70093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Struhal W, Almamoori D. A review of the sequelae of post Covid-19 with neurological implications (post-viral syndrome). J Neurol Sci. 2025;474:123532. [DOI] [PubMed] [Google Scholar]
- 34.Crook H, Raza S, Nowell J, et al. Long covid-mechanisms, risk factors, and management. BMJ. 2021;374:n1648. [DOI] [PubMed] [Google Scholar]
- 35.Fatoke B, Hui AL, Saqib M, et al. Type 2 diabetes mellitus as a predictor of severe outcomes in COVID-19 - a systematic review and meta-analyses. BMC Infect Dis. 2025;25:719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Lopez Barrera E, Miljkovic K, Barnor K, et al. Examining COVID-19 mortality inequalities across 169 countries: insights from the COVID-19 mortality inequality curve (CMIC) and theil index analysis. Health Policy. 2025;157:105345. [DOI] [PubMed] [Google Scholar]
- 37.Collins GS, Moons KGM, Dhiman P, et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Ning C, Ouyang H, Xiao J, et al. Development and validation of an explainable machine learning model for mortality prediction among patients with infected pancreatic necrosis. EClinicalMedicine. 2025;80:103074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Singer M, Deutschman CS, Seymour CW, et al. The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA. 2016;315:801–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Altalhan M, Algarni A, Turki-Hadj Alouane M. Imbalanced data problem in machine learning: A review. IEEE Access. 2025;13:13686–99. [Google Scholar]
- 41.Le Gall JR, Loirat P, Alperovitch A, et al. A simplified acute physiology score for ICU patients. Crit Care Med. 1984;12:975–7. [DOI] [PubMed] [Google Scholar]
- 42.Patterson BW, Hekman DJ, Liao FJ, et al. Call me Dr ishmael: trends in electronic health record notes available at emergency department visits and admissions. JAMIA Open. 2024;7:ooae039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Sung S-F, Chen C-H, Pan R-C, et al. Natural Language processing enhances prediction of functional outcome after acute ischemic stroke. J Am Heart Assoc. 2021;10:e023486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Kuo K-M, Lin Y-L, Chang CS, et al. An ensemble model for predicting dispositions of emergency department patients. BMC Med Inf Decis Mak. 2024;24:105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Ho TK. Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, Montreal, Que., Canada, 14–16 Aug. 1995, pp. 278–282: IEEE Comput. Soc. Press.
- 46.Dong R, Yao H, Chen T et al. Machine learning-based prediction of in-hospital mortality in severe COVID-19 patients using hematological markers. Can J Infect Dis Med Microbiol. 2025;2025:6606842. [DOI] [PMC free article] [PubMed]
- 47.Snopkowska Lesniak SW, Maschio D, Henriquez-Camacho C, et al. Biomarkers for SARS-CoV-2 infection. A narrative review. Front Med (Lausanne). 2025;12:1563998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Gebremeskel GG, Tadesse DB, Haile TG. Mortality and morbidity in critically ill COVID-19 patients: A systematic review and meta-analysis. J Infect Public Health. 2024;17:102533. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data is provided in the manuscript and supplementary material.





