Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2023 Oct 18;13:17708. doi: 10.1038/s41598-023-43943-9

Machine learning risk estimation and prediction of death in continuing care facilities using administrative data

Faezehsadat Shahidi 1, Elissa Rennert-May 2,3,7,8,9,10, Adam G D’Souza 4,6, Alysha Crocker 5, Peter Faris 2,6, Jenine Leal 2,7,8,10,11,
PMCID: PMC10584843  PMID: 37853045

Abstract

In this study, we aimed to identify the factors that were associated with mortality among continuing care residents in Alberta, during the coronavirus disease 2019 (COVID-19) pandemic. We achieved this by leveraging and linking various administrative datasets together. Then, we examined pre-processing methods in terms of prediction performance. Finally, we developed several machine learning models and compared the results of these models in terms of performance. We conducted a retrospective cohort study of all continuing care residents in Alberta, Canada, from March 1, 2020, to March 31, 2021. We used a univariable and a multivariable logistic regression (LR) model to identify predictive factors of 60-day all-cause mortality by estimating odds ratios (ORs) with a 95% confidence interval. To determine the best sensitivity–specificity cut-off point, the Youden index was employed. We developed several machine learning models to determine the best model regarding performance. In this cohort study, increased age, male sex, symptoms, previous admissions, and some specific comorbidities were associated with increased mortality. Machine learning and pre-processing approaches offer a potentially valuable method for improving risk prediction for mortality, but more work is needed to show improvement beyond standard risk factors.

Subject terms: Risk factors, Engineering, Biomedical engineering, Health care

Introduction

The COVID-19 pandemic had significant scientific, medical, and social effects on the population13. However, COVID-19 does not equally affect all members of society46. Long-term care (LTC) residents (frail older adults who are not able to live by themselves and require medical care4,5,79) and designated supportive living (DSL) residents (older adults who can live independently but with support10) initially became the central point of the COVID-19 pandemic globally and Canada was not an exception5,7, as more than half of the deaths related to COVID-19 occurred in LTC facilities79,11,12. As LTC and DSL residents are at high risk of mortality from COVID-19, it is important to understand the predictors that are highly associated with adverse outcomes in this population.

One potential option to explore predictors of adverse outcomes includes standard logistic regression15. However, some studies have explored predictors using machine learning (ML) models over LR due to their superior predictive performance, especially for large-sized administrative healthcare data1315. There are limitations for the LR models, including handling continuous variables, restricting the number of entered variables, and managing highly correlated variables16. Also, in a comprehensive survey study, it was found that the higher the degree of the class imbalance in the data, the greater the effects of the class imbalances on model performance17. Researchers have explored the likelihood of future outcomes using ML techniques, such as artificial neural networks (ANN)15, support vector machines (SVM)18, extreme gradient boosting (XGBoost)19, or random forests (RF)13 and advocated that ML models had better predictive performances than the standard LR model14.

The objective of our work is to compare LR and ML models to identify predictors of all-cause mortality20 among LTC and DSL residents in Alberta, Canada following a negative or positive COVID-19 test. Additionally, we developed and evaluated ML models, comparing them with LR models using the area under the curve (AUC) based on the Receiver Operating Characteristic (ROC) curve and sensitivity. This analysis aims to provide insights into all-cause mortality determinants and assess ML potential in improving predictive performance and patient outcomes in healthcare settings.

Methods

Study design and data access

This study was a retrospective cohort study using secondary data. The manuscript followed the observational routinely collected data (RECORD) reporting guideline21. The study adherence to the RECORD checklist items was demonstrated in Appendix 1. The study was reviewed and approved by the University of Calgary Ethics REB20-0688. Data Disclosure Agreements were developed with Alberta Health Services (AHS).

Setting and years of data

The data were collected to develop a population-based COVID-19 analytics and research database to understand local epidemiology and resource use implications of COVID-19. The data included every person in Alberta who tested for COVID-19 from March 1, 2020, to March 31, 2021. However, only the first positive record and the first negative record (for those people that never test positive) per person were included.

Data sources

The laboratory data, including age, sex, specimen collection information (location, and date), and the interpreted result (being symptomatic) were retrieved from the Alberta health services enterprise data warehouse (EDW)22,23. The specimen collection location based on date and the residency information were accessed through Alberta continuing care information system (ACCIS)24,25. Outpatient data, including previous ambulatory visits, previous procedures, and specimen collection locations based on data were accessed from the national ambulatory care reporting system (NACRS)26,27. Chronic conditions28 came from practitioner claims under the Alberta health care insurance plan (AHCIP)29 and discharge abstract database (DAD) databases. Also, previous hospitalizations, previous procedures, previous special care unit (SCU) visits, and specimen collection locations based on data were acquired from the DAD database30. The mortality data was gained from the Alberta vital statistics (AVS) database31. All of the administrative data codes32 used in this study are included in Appendix 2.

Data linkage and data preparation

The Data were linked using PHN (personal health number) or ULIs (universal lifetime identifier). Patient identifiers, including PHN, ULI, names, date of birth, and postal codes were all removed (Scrambled ULIs were provided). Afterward, all dates were removed and counts of days between the index date (COVID-19 collection date), and the date of interest were calculated, according to AHS de-identification policies. The flowchart diagram depicting the cohort creation, data cleaning, and analyses was included in Appendix 3.

Cohort study of LTC/DSL residents

The overall tested individuals within the data set included 1.8 million records. Among all tested populations, our study cohort comprised 25,586 individuals who were residents of LTC or DSL facilities in Alberta and had confirmed first positive or first negative (for those who were never positive) COVID-19 test results between March 1, 2020, and March 31, 2021.

Variables of interest

In alignment with a previous study33, comorbidities were assessed based on historical data from the two years preceding the index date. We also completed a one-year lookback before episodes of infection for the number of admissions, procedures, and special care unit (SCU) visits to establish a baseline of healthcare utilization33.

The primary outcome in our study was all-cause mortality within 60 days of a resident’s first COVID-19 polymerase chain reaction (PCR) positive or negative test result. According to an early study34, there is a time lag of two to eight weeks (60 days) between COVID-19 cases (index date) and death. In our study, we utilized all-cause mortality20 data within 60 days of a positive or negative COVID-19 test, as this information was available in vital statistics.

The selected covariates were based on both clinical expertise and previous literature that has examined the association between age and the Elixhauser comorbidity index35, as well as prior admissions36 and their connection to severe COVID outcomes. So, the selected covariates encompassed various factors, including age, demographic characteristics (sex, LTC vs DSL resident, and specimen collection location), comorbidities (based on the Elixhauser index), the number of previous procedures (inpatient and outpatient), the number of previous admissions (hospital and ICU), symptomatic status, specimen year-month collection, and the results of the PCR test.

Statistical analysis

Missing values were addressed by employing a strategy that involves imputation from other relevant columns, illustrated in the flowchart in Appendix 3. To illustrate, when dealing with missing values in the "Symptomatic during collection" feature, we have imputed these values using the corresponding test results. Additionally, we addressed outliers within the dataset, removing a minimal percentage of records (specifically, one record from a 17-year-old individual and five with unidentified sex), which collectively constituted only  0.02% of all records. The removal of the 17-year-old individual from the study cohort was carried out to ensure that the data accurately represents the target population, which includes individuals older than 18 years of age. The exclusion of individuals of unidentified sex was performed to address potential errors in data collection or recording and to enhance the overall quality and accuracy of the dataset. The steps were illustrated in the flowchart in Appendix 3.

In this study, descriptive statistics were utilized to provide a comprehensive overview of the cohort characteristics. Descriptive statistics including frequencies and percentages for categorical variables and means with standard deviations (SD) for normally distributed continuous variables and medians with interquartile ranges (IQR) for skewed continuous variables were used to describe the characteristics of the cohort.

In this study, we aimed to directly estimate the odds37 of mortality occurring. Given a binary outcome variable, we chose standard logistic regression (LR)38 over modified Poisson regression as LR is more suitable for estimating odds in binary outcomes. Univariable LR was used to identify individual predictive factors of 60-day mortality by estimating odds ratios (ORs)39 with a 95% of confidence interval (CI). A multivariable LR was applied to examine the joint association of all risk factors with 60-day mortality with adjusted ORs (aORs) and 95% CIs.

For cross validating the predictive models, the data were split into a training set (90% of the sample data) and a test set (10% of the sample) randomly40. The forward selection method was used to add the variables to the predictive models by iteratively adding features (predictor variables) to the model41. We used sensitivity as a measure to check which predictors are to be included in the model.

While useful for evaluating standard LR models, the AUC, sensitivity, and specificity values do not explicitly identify the best cut points42. To identify ideal cut points Youden index (J) method was proposed43. Youden’s index measures the difference between the true positive rate and the false positive rate across all potential cut-point values to calculate the perfect cut-point44,45.

In this study, the pre-processing techniques include random over-sampling to balance the classes and power transformation (PT) to normalize the data17,4648. We employed power transformation to address non-normality and skewness in the data distribution and successfully normalized the data49. This transformation was found to be more effective in terms of performance, leading to improved results in our analysis.

Oversampling techniques were proven to outperform under-sampling methods by addressing data imbalance without losing valuable information, ultimately leading to improved model performance50. So, the random over-sampling technique (ROTE) and synthetic minority over-sampling technique (SMOTE)51,52 were applied separately for the training set. The test set was kept untouched for the final performance report and the LR models using initial data using 0.5 threshold (model 1), initial data using 0.083 threshold (model 2), SMOTE using 0.46 threshold (model 3), and ROTE using 0.42 threshold (model 4) were evaluated.

Model development

Along with models 1–4 (LR models with pre-processing done), ML models, including RF, SVM, XGBoost, and ANN have been developed and tested to predict the 60-day incidence of mortality1315,18. An alpha level of 0.05 (two-tailed) was used to assess statistical significance. We used Advanced Research Computing (ARC) cluster at the University of Calgary53, to submit several jobs in parallel. Each job had a unique input to do hyper-parameter optimization. In this examination, about 200,000 experiments were performed to detect the best hyper-parameters for all the ML models. The best results received by these hyper-parameters are illustrated in Appendix 4. These models were examined using normalized data and balanced classes. Data analysis and model development were performed by using Python programming language (version 3.9) and packages were illustrated in Appendix 5. As a popular scripting language for coordinating shell tasks, the bourne-again shell (Bash) was utilized to manage the resources in ARC54.

Several models were examined in this study. First was the RF model55, an ML algorithm, that works based on the majority decisions of several ‘decision trees’ that are centered on randomly selected variables in a dataset56. Second, SVM classifies by using a multidimensional hyperplane (to maximize the margin between the clusters) and a nonlinear function (kernel)57. Third, XGBoost was appealing since, when compared to the other classifiers, its default option had a great average performance58. XGBoost is based on a decision-tree ensemble algorithm that uses a gradient-boosting framework. Finally, ANN is becoming a common ML model in the field of health care5964. This model has at least three layers, including input (receives the features), hidden (extracts the patterns based on weights), and output (presents the output)59. Each layer has nodes or neurons that are connected to their adjacent unit by a set of adjustable weights62. Like LR, the ANN models benefit from activation functions (like sigmoid) to produce the output65.

Results

Among all LTC residents who were tested for COVID-19 from March 1, 2020, to March 31, 2021, six individuals were removed due to missing values in sex and outliers in age. Therefore, we had 25,580 continuing care residents (16,219 women [63.41%]; median age, 84 years [interquartile range, 80 + years]; of which 15,066 were LTC residents [58.9%], and 10,514 were DSL residents [41.1%]. Among all residents, 14,470 were tested in LTC [56.57%], 8748 were tested in DSL [38.11%], 743 were tested in ED [2.9%], and 619 were tested in hospital [2.42%]. Nearly 14% (n = 3489, 13.64%) had missing symptom information, 75.32% were asymptomatic, and 11.04% had reported symptoms at the time of testing. In this cohort, 3945 tested positive [15.42%] for COVID-19, and 93.19% (n = 23,838) had at least one comorbidity (Appendix 6).

Overall, 2590 residents (10.12%) died (1454 women [56.14%]; median age, 88 years [interquartile range, 80 + years]; of which 1893 were LTC residents [73.09%], and 697 were DSL residents [26.91%]. The largest proportion of deaths (18.38%) occurred among continuing-care residents tested for COVID-19 in April 2020. All characteristics are described in Table 1. The horizontal bar charts in Appendix 7 present the characteristics of continuing-care residents in Alberta who died. For 2,560 residents who died within 60 days, the number of previous mean hospital admissions for those who were tested positive for COVID was 0.7 compared to those who were tested negative for COVID which was 0.5. SCU admissions were 0.02 and 0.01 for those with and without COVID. For those who tested positive for COVID-19, the mean number of previous inpatient and outpatient procedures was 0.4 and 5.6, respectively. Those who were tested negative for COVID-19 also had a mean of 5.2 outpatient procedures and 0.3 inpatient procedures in the past.

Table 1.

Characteristics of continuing care residents in Alberta who were confirmed tested positive or negative with a Covid-19 infection between March 1, 2020, to March 31, 2021.

Characteristic 60-day mortality Total = 25,580
Yes = 2590 (10.12%) No = 22,990 (89.87%)
Age category
 80 + 2022 (78.07) 14,646 (63.71) 16,668 (65.16)
 70–79 386 (14.9) 4555 (19.81) 4941 (19.32)
 60–69 144 (5.56) 2291 (9.97) 2435 (9.52)
 50–59 28 (1.08) 994 (4.32) 1022 (4)
 40–49 4 (0.15) 303 (1.32) 307(1.2)
 30–39 4 (0.15) 156 (0.68) 160 (0.63)
 18–29 2 (0.08) 45 (0.2) 47 (0.18)
Gender
 Female 1454 (56.14) 14,765 (64.22) 16,219 (63.41)
 Male 1136 (43.86) 8225 (35.78) 9361 (36.59)
Specimen collection location
 LTC 1706 (65.87) 12,764 (55.52) 14,470 (56.57)
 DSL 520 (20.08) 9228 (40.14) 9748 (38.11)
 ED 217 (8.38) 526 (2.29) 743 (2.9)
 H 147 (5.68) 472 (2.05) 619 (2.42)
Resident during collection
 LTC 1893 (73.09) 13,173 (57.3) 15,066 (58.9)
 DSL 697 (26.91) 9817 (42.7) 10,514 (41.1)
Symptomatic during collection
 No 1573 (60.73) 17,695 (76.97) 19,268 (75.32)
 Unknown 532 (20.54) 2957 (12.86) 3489 (13.64)
 Yes 485 (18.73) 2338 (10.17) 2823 (11.04)
Result of the covid test
 Negative 1594 (61.54) 20,041 (87.17) 21,635 (84.58)
 Positive 996 (38.46) 2949 (12.83) 3945 (15.42)
Specimen year-month collection
 2020-3 (original variant) 192 (7.41) 835 (3.63) 1027 (4.01)
 2020-4 (alpha) 476 (18.38) 3226 (14.03) 3702 (14.47)
 2020-5 271(10.46) 2054 (8.93) 2325 (9.09)
 2020-6 338 (13.05) 9192 (39.98) 9530 (37.26)
 2020-7 117 (4.52) 894 (3.89) 1011 (3.95)
 2020-8 50 (1.93) 549 (2.39) 599 (2.34)
 2020-9 59 (2.28) 514 (2.24) 573 (2.24)
 2020-10 (delta) 198 (7.64) 1226 (5.33) 1424 (5.57)
 2020-11 276 (10.66) 1389 (6.04) 1665 (6.51)
 2020-12 (beta) 373 (14.4) 1827 (7.95) 2200 (8.6)
 2021-1 (gamma) 199 (7.68) 846 (3.68) 1045 (4.09)
 2021-2 (theta) 30 (1.16) 277 (1.2) 307 (1.2)
 2021-3 11 (0.42) 161 (0.7) 172 (0.67)
Comorbidities
 Hypertension, uncomplicated 1307 (50.46) 11,751 (51.11) 13,058 (51.05)
 Depression 1121 (43.28) 10,922 (47.51) 12,043(47.08)
 Other neurological disorders 1060 (40.93) 9413 (40.94) 10,473 (40.94)
 Diabetes 802 (30.97) 6557 (28.52) 7359 (28.77)

LTC long-term care, DSL designated support living, ED emergency department, H hospital.

The horizontal bar charts in Appendix 8 illustrate the multivariable associations between clinical risk factors and 60-day all-cause mortality in continuing-care residents in Alberta. In our univariable and multivariable analyses between resident characteristics and 60-day mortality, the following associations were observed (Table 2).

  1. Sociodemographic characteristics: In both analyses, older age was associated with a higher probability of 60-day mortality. The odds of death increased by a factor of 1.05 per one-year increase in age [95% CI 1.04–1.05] (p < 0.01). Concerning other demographic risk factors, men had a higher risk than women for 60-day mortality (aORs 1.57 [95% CI 1.42–1.73]; p < 0.01). The risk of death was lower among DSL residents compared with LTC residents (DSL residents: aORs 0.54 [95% CI 0.41, 0.71], p < 0.01). As for LTC residents, 39% were male, 18% tested positive for COVID-19, 64% were 80 years old or older, and 92% had at least one comorbidity. For DSL residents, 33% were male, 13% tested positive, 66% were 80 years old or older, and 95% had at least one comorbidity. By comparison, the percentages of males and individuals who tested positive living in LTC were greater than those in DSL. For the locations where the specimen was collected, the highest risk of death was for the ones who were tested in the emergency department (aORs 4.54 [95% CI 3.57–5.77], p < 0.01); followed by those in the hospital (aORs 3.6 [95% CI 2.76–4.68], p < 0.01). There was no evidence that those residents tested while still residing in their DSL facility had decreased odds of death (aORs 0.97 [95% CI 0.72–1.30], p = 0.82]).

  2. Clinical characteristics: In both analyses, having symptoms was associated with the highest probability of 60-day mortality. Compared with the residents with no symptoms, the adjusted odds of death were 2.02 (95% CI 1.77–2.32) times higher for the ones with symptoms, and 1.45 (95% CI 1.27–1.65) times higher for the residents with unknown symptoms. Additionally, those who tested positive for COVID-19 had a greater risk of 60-day mortality than those who tested negative for this disease (aORs 3.61 [95% CI 3.14–4.15]).

  3. Hospital admission-related variables: Despite not reaching strict statistical significance, a discernible trend suggests that an increased number of previous hospital admissions (aORs 1.02 [95% CI 0.97–1.08], p = 0.44), SCU admissions (aORs 1.15 [95% CI 0.83–1.60], p = 0.41), and inpatient procedures (aORs 1.01 [95% CI 0.97–1.06], p = 0.59) were associated with higher odds of 60-day mortality.

  4. Comorbidities: Among all residents who died, 1307 (50.46%) had uncomplicated hypertension, 1121 (43.28%) had depression, 1060 (40.93%) had neurological disorders, 809 (31.24%) had fluid and electrolyte disorders, and 802 (30.97%) had diabetes (Table 1). The most prevalent comorbidities were listed in Appendix 6. When considered together with clinical, inpatient, and demographic characteristics, among the most prevalent comorbidities we identified that fluid and electrolyte disorders (aORs 1.19 [95% CI 1.07–1.34]), and diabetes (aORs 1.11 [95% CI 1.00–1.23]) were associated with a higher probability of 60-day mortality. Metastatic cancer (aORs 1.58 [95% CI 1.14–2.19]) and chronic liver disease (aORs 1.51 [95% CI 1.15–1.99]) were the strongest risk factors of 60-day mortality (Table 2). The OR and aORs of the most prevalent comorbidities were listed in Appendix 9.

Table 2.

Associations between clinical risk factors and 60-day cause-specific mortality in continuing care residents in Alberta who were confirmed tested first positive or first negative with a Covid-19 infection between, March 1, 2020, to March 31, 2021.

Characteristic Univariable Multivariable
ORs (95% CI) p-value AORs (95% CI) p-value
Age 1.04 (1.03, 1.04)  < 0.01 1.05 (1.04, 1.05)  < 0.01
Gender
 Male 1.40 (1.29, 1.52)  < 0.01 1.57 (1.42, 1.73)  < 0.01
 Female 1 [Reference] 1 [Reference]
Specimen collection location
 LTC 1 [Reference] 1 [Reference]
 DSL 0.42 (0.38, 0.47)  < 0.01 0.97 (0.72, 1.30) 0.82
 ED 3.09 (2.61, 3.64)  < 0.01 4.54 (3.57, 5.77)  < 0.01
 H 2.33 (1.92, 2.82)  < 0.01 3.6 (2.76, 4.68)  < 0.01
Resident during collection
 LTC 1 [Reference] 1 [Reference]
 DSL 0.49 (0.45,0.54)  < 0.01 0.54 (0.41, 0.71)  < 0.01
Symptomatic during collection
 No 1 [Reference] 1 [Reference]
 Yes 2.33 (2.09, 2.61)  < 0.01 2.02 (1.77, 2.32)  < 0.01
 Unknown 2.02 (1.82, 2.25)  < 0.01 1.45 (1.27, 1.65)  < 0.01
Result of the covid test
 Negative 1 [Reference] 1 [Reference]
 Positive 4.25 (3.89,4.64)  < 0.01 3.61 (3.14, 4.15)  < 0.01
Specimen year-month collection
 2020-3 (original) 6.25 (5.17, 7.57)  < 0.01 4.62 (3.71, 5.75)  < 0.01
 2020-4 (alpha) 4.01 (3.47, 4.64)  < 0.01 2.96 (2.52, 3.48)  < 0.01
 2020-5 3.59 (3.04, 4.24)  < 0.01 2.08 (1.71, 2.53)  < 0.01
 2020-6 1 [Reference] 1 [Reference]
 2020-7 3.56 (2.85, 4.44)  < 0.01 2.15 (1.68, 2.76)  < 0.01
 2020-8 2.48 (1.82, 3.37)  < 0.01 1.74 (1.24, 2.44)  < 0.01
 2020-9 3.12 (2.33, 4.17)  < 0.01 1.71 (1.24, 2.36)  < 0.01
 2020-10 (delta) 4.39 (3.65, 5.29)  < 0.01 2.15 (1.73, 2.68)  < 0.01
 2020-11 5.40 (4.56, 6.40)  < 0.01 2.36 (1.91, 2.90)  < 0.01
 2020-12 (beta) 5.55 (4.75, 6.49)  < 0.01 2.18 (1.78, 2.68)  < 0.01
 2021-1 (gamma) 6.4 (5.30, 7.73)  < 0.01 2.03 (1.59, 2.58)  < 0.01
 2021-2 (theta) 2.95 (1.99, 4.36)  < 0.01 2.04 (1.32, 3.15)  < 0.01
 2021-3 1.86 (1.00, 3.46) 0.05 1.47 (0.72, 3.00) 0.29
Hospital and outpatient
 #H admits 1 year 1.12 (1.08, 1.16)  < 0.01 1.02 (0.97, 1.08) 0.44
 #SCU admits 1 year 1.21 (0.94, 1.55) 0.13 1.15 (0.83, 1.60) 0.41
 #DAD proc 1 year 1.04 (1.01, 1.07) 0.01 1.01 (0.97, 1.06) 0.59
 #NACRS proc 1 year 1.00 (0.99, 1) 0.47 1.00 (1.00, 1.00) 0.52
Comorbidities
 Num of Elixhauser 1.05 (1.04, 1.07)  < 0.01 1.03 (1.01, 1.05)  < 0.01
 Fluid and electrolyte disorders 1.47 (1.35, 1.61)  < 0.01 1.19 (1.07, 1.34)  < 0.01
 Diabetes 1.12 (1.03, 1.23) 0.01 1.11 (1.00, 1.23) 0.06
 Metastatic cancer 1.92 (1.48, 2.5) 0.00 1.58 (1.14, 2.19)  < 0.01
 Liver disease 1.24 (0.98, 1.56) 0.07 1.51 (1.15, 1.99)  < 0.01

LTC long-term care, DSL designated support living, ED emergency department, H hospital, Y year, Proc procedures.

We also investigated the adjusted associations between clinical risk factors and 60-day all-cause mortality, stratified by sex illustrated in Table 3. Despite the overall similarity in the results for both genders, a slightly stronger association was observed in females compared to males for being symptomatic. Females had an aORs of 2.10 (95% CI 1.76, 2.51, p < 0.01) for being symptomatic, while males had an aORs of 1.77 (95% CI 1.43, 2.19, p < 0.01). However, the aORs for a positive result of the Covid-19 test and 60-day all-cause mortality were higher in the male population (aORs 4.25, 95% CI 3.40, 5.32, p < 0.01) compared to the female population (aORs 3.26, 95% CI 2.72, 3.91, p < 0.01). For diabetes, the male population showed a significant association (aORs 1.22, 95% CI 1.04, 1.43, p = 0.01), while no significant association was found in females (aORs 0.99, 95% CI 0.85, 1.15, p = 0.91).

Table 3.

Associations between clinical risk factors and 60-day cause-specific mortality in continuing care residents in Alberta who were confirmed tested first positive or first negative with a Covid-19 infection between, March 1, 2020, to March 31, 2021.

Characteristic Multivariable
Male Female
AORs (95% CI) p-value AORs (95% CI) p-value
Age 1.05 (1.04,1.06)  < 0.01 1.05 (1.04,1.05)  < 0.01
Specimen collection location
 LTC 1 [Reference] 1 [Reference]
 DSL 1.04 (0.65, 1.64) 0.88 0.92 (0.62, 1.36) 0.690
 ED 4.07 (2.82, 5.86)  < 0.01 4.38 (3.15, 6.10)  < 0.01
 H 3.21 (2.15, 4.79)  < 0.01 3.78 (2.64, 5.42)  < 0.01
Resident during collection
 LTC 1 [Reference] 1 [Reference]
 DSL 0.56 (0.37, 0.86)  < 0.01 0.53 (0.37, 0.76)  < 0.01
Symptomatic during collection
 No 1 [Reference] 1 [Reference]
 Yes 1.77 (1.43, 2.19)  < 0.01 2.10 (1.76, 2.51)  < 0.01
 Unknown 1.28 (1.05, 1.57) 0.02 1.47 (1.24, 174)  < 0.01
Result of the covid test
 Negative 1 [Reference] 1 [Reference]
 Positive 4.25 (3.40, 5.32)  < 0.01 3.26 (2.72, 3.91)  < 0.01
Specimen year-month collection
 2020-3 (original) 4.85 (3.48, 6.75)  < 0.01 4.50 (3.36, 6.01)  < 0.01
 2020-4 (alpha) 3.16 (2.45, 4.08)  < 0.01 2.80 (2.27, 3.46)  < 0.01
 2020-5 2.59 (1.92, 3.50)  < 0.01 1.92 (1.48, 2.48)  < 0.01
 2020-6 1 [Reference] 1 [Reference]
 2020-7 1.79 (1.16, 2.76)  < 0.01 2.50 (1.84, 3.40)  < 0.01
 2020-8 1.76 (1.04, 2.99) 0.03 1.74 (1.12, 2.72) 0.01
 2020-9 1.25 (0.71, 2.20) 0.43 2.19 (1.48, 3.24)  < 0.01
 2020-10 (delta) 1.97 (1.39, 2.79)  < 0.01 2.29 (1.72, 3.05)  < 0.01
 2020-11 2.61 (1.89, 3.61)  < 0.01 2.11 (1.60, 2.77)  < 0.01
 2020-12 (beta) 2.45 (1.78, 3.38)  < 0.01 2.03 (1.55, 2.66)  < 0.01
 2021-1 (gamma) 1.86 (1.26, 2.73)  < 0.01 2.32 (1.71, 3.16)  < 0.01
 2021-2 (theta) 1.85 (0.93, 3.68) 0.08 2.30 (1.32, 3.99)  < 0.01
 2021-3 0.80 (0.18, 3.49) 0.76 1.89 (0.80, 4.48) 0.15
Hospital and outpatient
 #H admits 1 year 1.00 (0.92, 1.08) 0.92 1.02 (0.94, 1.10) 0.67
 #SCU admits 1 year 1.07 (0.67, 1.72) 0.77 1.02 (0.74, 1.76) 0.55
 #DAD proc 1 year 1.00 (0.95, 1.06) 0.94 1.02 (0.95, 1.09) 0.56
 #NACRS proc 1 year 1.00 (1.00, 1.00) 0.50 1.00 (1.00, 1.00) 0.59
Comorbidities
 Fluid and electrolyte disorders 1.12 (0.94, 1.34) 0.2 1.21 (1.04, 1.41) 0.01
 Diabetes 1.22 (1.04, 1.43) 0.01 0.99 (0.85, 1.15) 0.91
 Metastatic cancer 1.74 (1.09, 2.78) 0.02 1.4 (0.88, 2.24) 0.15
 Liver disease 1.63 (1.09, 2.43) 0.02 1.31 (0.89, 1.93) 0.17

LTC long-term care, DSL designated support living, ED emergency department, H hospital, Y year, Proc procedures.

Model performance

The univariable comparison in Appendix 10 shows that the ORs for the individual features of the initial data (without balancing) and data with balanced classes (using ROTE and SMOTE) were almost the same. After examination of the standard LR model (model 1–4), the critical weakness of predictive analytics was found to be the cut-off point66. After identifying the ideal sensitivity–specificity cut-off point using Youden’s index, we could improve the performance. The sensitivity (from 6 to 77%) and AUC (from 53 to 71%) scores for the LR model were increased by utilizing the optimal cut-point value (Table 4). Model 4 (LR model along with PT technique67, and ROTE methods) could achieve better performance in terms of sensitivity (1% increased) than model 2 (LR model without using pre-processing techniques).

Table 4.

The performance metric of all the predictive modeling.

Machine learning models Sensitivity (%) AUC (%) Cut-off point
Model 1 (LR model + initial data) 6 53 0.50
Model 2 (LR model + initial data + J) 77 71 0.083
Model 3 (LR model + PT + SMOTE + J) 75 72 0.46
Model 4 (LR model + PT + ROTE + J) 78 71 0.42
RF (PT + ROTE) 66 71 0.50
SVM (PT + ROTE) 73 72 0.50
XGBoost (PT + ROTE) 74 73 0.50
ANN (1-layer + PT + ROTE) 78 73 0.50
ANN (2-layer + PT + ROTE) 81 73 0.50
ANN (3-layer + PT + ROTE) 82 73 0.50

Significant values are in bold.

LR logistic regression, PT power transformation for normalizing the data, SMOTE synthetic minority over-sampling technique, ROTE random over-sampling technique, J Youden index, RF random forest, SVM support vector machine, ANN artificial neural network.

In this examination, by using optimal cut-off points and pre-processing methods the sensitivity (from 6 to 78%) and AUC (from 53 to 71%) scores were improved for the LR model (Table 4). Also, in this study, we examined three different ML models, including RF, SVM, XGBoost, and ANN. Among all, a 3-hidden-layers ANN model (included in Appendix 11) could accomplish the best performance in terms of sensitivity and AUC which was 82%, and 73% respectively. In Appendix 12, we present informative horizontal bar charts showcasing the sensitivity and AUC values of all the discussed models used in our study.

Discussion

The primary outcome variable in this study was mortality within 60 days of a resident's first COVID-19 test. We aimed to identify factors associated with mortality among continuing care residents in Alberta during the COVID-19 pandemic and develop machine learning models to improve risk prediction for mortality. In this study, we employed a diverse range of databases, including EDW, ACCIS, NACRS, Claims, DAD, and AVS.

We identified several characteristics associated with mortality, including advanced age, male sex, metastatic cancer, chronic liver disease, having symptoms, previous SCU, and hospital admissions. Combining these factors achieved higher estimation in this population than if they were considered individually. Our findings were aligned with previous research as increased age, male sex, previous intensive care unit (ICU) admissions, and chronic conditions, like cancer and chronic liver disease, have been already recognized as risk factors for mortality in ill patients with COVID-199,6871.

The outcome variable of a study conducted by Panagiotou et al.9 was death due to any cause within 30 days of a resident's first positive COVID-19 polymerase chain reaction test result. The study leveraged unique electronic medical record data and other clinical data from a large multistate sample of US nursing homes. The study found that increased age, male sex, impaired cognitive and physical function, diabetes, chronic kidney disease, fever, shortness of breath, tachycardia, and hypoxia were independently associated with mortality in US nursing home residents with COVID-19. Understanding these risk factors can aid in the development of clinical prediction models of mortality in this population. Both our study and Panagiotou et al.9 identified increased age and male sex as risk factors for mortality in COVID-19 patients. Both studies also highlighted previous ICU admissions and chronic conditions (e.g., cancer and chronic liver disease) as risk factors. However, Panagiotou et al.9 focused on US nursing home residents, while our study may have a broader population.

The primary outcome variable was 28-day in-hospital mortality in a study done by Gupta et al.68. Other outcome variables included discharge from the hospital and remaining hospitalized at the end of the study follow-up. The study assessed 2215 adults with laboratory-confirmed COVID-19 who were admitted to intensive care units (ICUs) at 65 hospitals across the US from March 4 to April 4, 2020. The study identified demographic, clinical, and hospital-level risk factors that may be associated with death in critically ill patients with COVID-19. Factors independently associated with death included older age, male sex, higher body mass index, coronary artery disease, active cancer, and the presence of hypoxemia, liver dysfunction, and kidney dysfunction at ICU admission. Patients admitted to hospitals with fewer ICU beds had a higher risk of death. Hospitals varied considerably in the risk-adjusted proportion of patients who died and in the percentage of patients who received hydroxychloroquine, tocilizumab, and other treatments and supportive therapies. Our study and Gupta et al.68, both found older age and male sex to be risk factors for mortality in critically ill COVID-19 patients. But Gupta et al.68 focused on ICU-admitted patients and identified additional risk factors, including higher body mass index, coronary artery disease, active cancer, and specific organ dysfunctions.

The outcome variables of the study performed by Kuderer et al.70 were severe COVID-19 illness, hospitalization, admission to the ICU, mechanical ventilation, and death. The data sources were electronic medical records and patient-reported outcomes. The key findings of the study were that patients with cancer who contracted COVID-19 had a higher risk of severe illness, hospitalization, admission to the ICU, mechanical ventilation, and death compared to the general population. Additionally, patients with active cancer and those receiving cancer treatment had a higher risk of these outcomes compared to those with a history of cancer or those not receiving treatment. Our study, similar to Kuderer et al.70 revealed that cancer patients with COVID-19 faced an elevated risk of death compared to the general population. Yet, Kuderer et al.70 concentrated on COVID-19 outcomes in cancer patients, whereas our study encompassed a more diverse patient population.

Williamson et al.71 examined factors associated with COVID-19-related death. The study analyzed primary care records of 17,278,392 adults linked to 10,926 COVID-19-related deaths. Key findings showed associations between COVID-19-related death and male gender, greater age, deprivation, diabetes, severe asthma, and other medical conditions. Black and South Asian individuals had a higher risk, even after adjusting for other factors. The study provided valuable insights from one of the largest cohort studies on this topic. Our study, like Williamson et al.71 observed associations between COVID-19-related death and male gender, greater age, and specific medical conditions such as diabetes and severe asthma. However, while Williamson et al.71 analyzed primary care records, our study potentially utilized different data sources.

Grasselli et al.69, evaluated independent risk factors associated with mortality of COVID-19 patients treated in ICUs in Lombardy, Italy. The study included 3988 critically ill patients with laboratory-confirmed COVID-19 referred for ICU admission from February 20 to April 22, 2020. Key findings revealed that older age, male sex, high fraction of inspired oxygen, high positive end-expiratory pressure or low Pao2:Fio2 ratio on ICU admission, and history of chronic obstructive pulmonary disease, hypercholesterolemia, and type 2 diabetes were independently associated with mortality. Our study, along with Grasselli et al.69 identified older age and male sex as risk factors for mortality in COVID-19 patients. However, Grasselli et al.69 focused on patients treated in ICUs, whereas our study may have a broader population.

We also conducted gender-specific analyses to investigate the adjusted associations between clinical risk factors and 60-day all-cause mortality. While the overall results showed similarities between both genders, a slightly stronger association was observed in females compared to males for being symptomatic. Notably, aORs for a positive result of the Covid-19 test and 60-day all-cause mortality were higher in males compared to females. Additionally, while diabetes was significantly associated with mortality in males, no significant association was found in females. These findings emphasize the importance of gender-specific analyses to better understand the impact of these clinical risk factors on mortality outcomes.

The finding in this study was that the mortality rates were higher among LTC residents with COVID-19 compared with DSL residents even though DSL and LTC residents are almost the same in terms of characteristics, such as age, sex, and comorbidities. Our study is the first in our jurisdiction to compare rates of mortality between LTC and DSL using administrative data. We discovered the importance of the cut-off point for predictive modeling using the Youden index improved the sensitivity and AUC which is aligned with previous studies72. In the investigation of the pre-processing methods using administrative data and the LR model (to identify the individuals at risk of 60-day mortality), we found that using normalization (power transformation)73 and balancing classes (over-sampling techniques)74, improved the sensitivity and AUC that is aligned with previous studies17,47,51,52,75. According to these studies, the large degree of the imbalanced classes (in our dataset the rate of the “survivor” class outweighed the rate of “death” class 9 to 1) lowered the sensitivity, and therefore strategies to improve the model should be considered17. In most of these studies, over-sampling techniques have been proposed to solve the imbalanced class issue.

In this study, the ANN model received the best performance in terms of sensitivity and AUC which aligns with previous studies. In a study done by Sanderson et al.15, they discussed that the ANN model could outperform the LR in terms of both sensitivities by 4% and AUC by 2%. In this work, the author stated that ANN models can learn the non-linear relationships between predictors and outcomes. Also, they are capable of scaling well to large datasets. Furthermore, the ANN model was advocated by other studies as well5964. In our study, the ANN model outperformed the standard LR (model 4) in both sensitivity and AUC metrics by 4% and 2%, respectively. ANN may not offer a significant improvement over LR. In many cases, LR's advantage of providing clearer interpretations and similar performance makes it the preferred choice. However, in scenarios where enhanced sensitivity is needed, such as with large datasets or specific organizational requirements, ANN may be a more suitable option.

We had several limitations that should be acknowledged. First, the administrative data used was not collected for this specific purpose. However, by using existing administrative data, we were able to obtain results that would have otherwise been restricted by the cost of primary data collection76. By using secondary data sources, certain variables were restricted from use or pre-defined as part of the Alberta COVID-19 Analytics and Research Database. For example, some variables were already coded, and data linkage had been done already. The residual and unmeasured confounding in terms of facility-level and patient-level characteristics limited us from further exploration into the difference in the risk of death between LTC and DSL residents.

Second, the cohort was defined based on continuing care residents being tested for COVID-19, which means that those not tested for COVID-19 were not included potentially introducing a selection bias. However, we believe this was somewhat mitigated as we have captured most residents in continuing care based on previous estimates of the number of residents in continuing care77. Also, we had to rely on all-cause mortality data as we did not have access to the specific data required for calculating cause-specific mortality78. The vital statistics data lacked cause of death details, preventing us from categorizing deaths as COVID-related or not. Although we acknowledge that cause-specific mortality would provide a more comprehensive understanding, we believe that using all-cause mortality can still be a suitable proxy for deaths related to or indirectly caused by COVID-19.

Third, based on COVID-19 variants in Canada79, the Delta and Beta variants of Covid-19 emerged one month before and one month after November 2020, respectively. When considering the year and month of the specimen collection in our cohort, the odds of death in November 2020 and in December 2020 were highest. Yet, we did not have enough data to know in which month the patients died.

Lastly, the Youden index, which was used in this study is a measure of the overall performance of a model and is not directly related to calibration. We opted to use the cut-off point rather than calibration due to its simplicity and ease of interpretation. Calibration, on the other hand, involves fine-tuning the model probabilities to match the observed outcomes, which can be more complex and computationally demanding. Investigating calibration in different risk strata would provide additional information about the performance of the model. In expressions of ML algorithms, the interplay between overfitting and underfitting was a challenge as the success of these algorithms depends on the selection of the parameters according to the number of observations, and features80.

In conclusion, in this cohort study of Alberta continuing care residents tested for COVID-19, the all-cause 60-day mortality rate was 10.12%. COVID-19 test results and other characteristics, metastatic cancer, chronic liver disease, advancing age, and male sex, increased the risk of death. The LTC residents were also at higher risk of death compared to the DSL residents. We examined the pre-processing methods and ML models for predicting mortality and found that the combination of normalization, random oversampling techniques, and three-layer neural network classification provided superior prediction to other ML models. Our findings can enhance treatment decisions in healthcare. Further exploration is needed to harness the potential of ANN models for improved outcomes. ANNs offer data-driven insights, leading to precise diagnoses, optimized treatment plans, and better patient outcomes. Rigorous validation and collaboration between healthcare professionals and AI experts are crucial for safe and effective integration in clinical practice.

Supplementary Information

Acknowledgements

The authors gratefully acknowledge Dr. Geoffrey Messier and Dr. Ethan MacDonald for their invaluable advice and assistance, and the University of Calgary's high-performance computer team for their support.

Author contributions

F.S.H. developed the methodology, conducted the experiment(s), analyzed the results, and wrote the original manuscript. J.L. conceptualized and supervised the study, provided project administration, and acquired funding. E.R.M. conceptualized and supervised the study, provided project administration, and acquired funding. P.F. informed the methodology, and supervised analysis. A.S. acquired and curated the data for the study. A.C. supported the acquisition and curation of the study data. All authors reviewed the manuscript.

Funding

This study was funded by academic start-up funds to JL and ERM from the O’Brien Institute for Public Health, the Department of Medicine, and the Centre for Health Informatics at the University of Calgary, Calgary Alberta Canada.

Data availability

The data for this study was provided by AHS. Due to the sensitivity of the information, AHS has restricted the use of the data and they are not publicly available. To obtain the data, researchers will need approval from a certified research ethics board in Alberta as well as a Data Sharing Agreement from AHS through ethics boards. The websites for ethics boards and data sharing agreements are respectively located at https://hreba.ca/ and https://www.albertahealthservices.ca/research/Page16074.aspx.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-023-43943-9.

References

  • 1.Ponti G, Maccaferri M, Ruini C, Tomasi A, Ozben T. Biomarkers associated with COVID-19 disease progression. Crit. Rev. Clin. Lab. Sci. 2020;57(6):389–399. doi: 10.1080/10408363.2020.1770685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lauring AS, Hodcroft EB. Genetic variants of SARS-CoV-2: What do they mean? JAMA. 2021;325(6):529–531. doi: 10.1001/jama.2020.27124. [DOI] [PubMed] [Google Scholar]
  • 3.Lam S, Lombardi A, Ouanounou A. COVID-19: A review of the proposed pharmacological treatments. Eur. J. Pharmacol. 2020;886:173451. doi: 10.1016/j.ejphar.2020.173451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Liu M, Maxwell CJ, Armstrong P, Schwandt M, Moser A, McGregor MJ, et al. COVID-19 in long-term care homes in Ontario and British Columbia. CMAJ. 2020;192(47):E1540–E1546. doi: 10.1503/cmaj.201860. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ballin M, Bergman J, Kivipelto M, Nordström A, Nordström P. Excess mortality after COVID-19 in Swedish long-term care facilities. J. Am. Med. Direct. Assoc. 2021;22(8):1574–1580. doi: 10.1016/j.jamda.2021.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Jin JM, Bai P, He W, Wu F, Liu XF, Han DM, et al. Gender differences in patients with COVID-19: Focus on severity and mortality. Front. Public Health. 2020 doi: 10.3389/fpubh.2020.00152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Stall NM, Jones A, Brown KA, Rochon PA, Costa AP. For-profit long-term care homes and the risk of COVID-19 outbreaks and resident deaths. CMAJ. 2020;192(33):E946. doi: 10.1503/cmaj.201197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Fisman DN, Bogoch I, Lapointe-Shaw L, McCready J, Tuite AR. Risk factors associated with mortality among residents with coronavirus disease 2019 (COVID-19) in long-term care facilities in Ontario, Canada. JAMA Netw. Open. 2020;3(7):e2015957. doi: 10.1001/jamanetworkopen.2020.15957. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Panagiotou OA, Kosar CM, White EM, Bantis LE, Yang X, Santostefano CM, et al. Risk factors associated with all-cause 30-day mortality in nursing home residents with COVID-19. JAMA Int. Med. 2021;181(4):439–448. doi: 10.1001/jamainternmed.2020.7968. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Slaughter S, Jones C, Eliasziw M, Ickert C, Estabrooks C, Wagg A. The changing landscape of continuing care in Alberta: Staff and resident characteristics in supportive living and long-term care. Healthc. Policy. 2018;14(1):44. doi: 10.12927/hcpol.2018.25549. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Canadian Institutes of Health Information. Pandemic Experience in the Long-Term Care Sector: How Does Canada Compare with Other Countries? (CIHI, 2020). https://www.cihi.ca/sites/default/files/document/covid-19-rapid-response-long-term-care-snapshot-en.pdf.
  • 12.Thompson DC, Barbu MG, Beiu C, Popa LG, Mihai MM, Berteanu M, et al. The impact of COVID-19 pandemic on long-term care facilities worldwide: An overview on international issues. BioMed. Res. Int. 2020 doi: 10.1155/2020/8870249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.King C, Strumpf E. Applying random forest in a health administrative data context: A conceptual guide. Health Serv. Outcomes Res. Methodol. 2022;22(1):96–117. doi: 10.1007/s10742-021-00255-7. [DOI] [Google Scholar]
  • 14.Tiwari P, Colborn KL, Smith DE, Xing F, Ghosh D, Rosenberg MA. Assessment of a machine learning model applied to harmonized electronic health record data for the prediction of incident atrial fibrillation. JAMA Netw. Open. 2020;3(1):e1919396. doi: 10.1001/jamanetworkopen.2019.19396. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Sanderson M, Bulloch AGM, Wang J, Williamson T, Patten SB. Predicting death by suicide using administrative health care system data: Can feedforward neural network models improve upon logistic regression models? J. Affect. Disord. 2019;257:741–747. doi: 10.1016/j.jad.2019.07.063. [DOI] [PubMed] [Google Scholar]
  • 16.Ranganathan P, Pramesh CS, Aggarwal R. Common pitfalls in statistical analysis: Logistic regression. Perspect. Clin. Res. 2017;8(3):148. doi: 10.4103/picr.PICR_87_17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Japkowicz N, Stephen S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002;6(5):429–449. doi: 10.3233/IDA-2002-6504. [DOI] [Google Scholar]
  • 18.Ramírez J, Monasterio V, Mincholé A, Llamedo M, Lenis G, Cygankiewicz I, et al. Automatic SVM classification of sudden cardiac death and pump failure death from autonomic and repolarization ECG markers. J. Electrocardiol. 2015;48(4):551–557. doi: 10.1016/j.jelectrocard.2015.04.002. [DOI] [PubMed] [Google Scholar]
  • 19.Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016). 10.1145/2939672.2939785.
  • 20.Uusküla A, Jürgenson T, Pisarev H, Kolde R, Meister T, Tisler A, Suija K, Kalda R, Piirsoo M, Fischer K. Long-term mortality following SARS-CoV-2 infection: A national cohort study from Estonia. Lancet Reg, Health Eur. 2022 doi: 10.1016/j.lanepe.2022.100394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.RECORD Reporting Guidelines. https://www.record-statement.org/. Accessed 15 Feb 2022.
  • 22.Health System Access for Research. https://www.albertahealthservices.ca/research/page8579.aspx. Accessed 13 April 2022.
  • 23.Provincial Health System Access—Home. https://extranet.ahsnet.ca/teams/AHSRA/SitePages/Home.aspx. Accessed 16 Feb 2022.
  • 24.Tate K, Hoben M, Grabusic C, Bailey S, Cummings GG. The association of service use and other client factors with the time to transition from home care to facility-based care. J. Am. Med. Direct. Assoc. 2022;23(1):133–140. doi: 10.1016/j.jamda.2021.06.027. [DOI] [PubMed] [Google Scholar]
  • 25.Alberta Continuing Care Information System Data Standard. Version 1.0: Open Government. https://open.alberta.ca/publications/alberta-continuing-care-information-system-data-standard-version-1-0. Accessed 13 April 2022.
  • 26.National Ambulatory Care Reporting System metadata (NACRS)|CIHI. https://www.cihi.ca/en/national-ambulatory-care-reporting-system-metadata-nacrs. Accessed 31 Jan 2022.
  • 27.Canadian Institute for Health Information. NACRS Data Elements, 2021–2022. (CIHI, 2021). https://www.cihi.ca/sites/default/files/rot/nacrs-data-elements-2021-2022-en.pdf. Accessed 13 April 2022.
  • 28.Van Walraven C, Austin PC, Jennings A, Quan H, Forster AJ. A modification of the Elixhauser comorbidity measures into a point system for hospital death using administrative data. Med. Care. 2009;1:626–633. doi: 10.1097/MLR.0b013e31819432e5. [DOI] [PubMed] [Google Scholar]
  • 29.Physician’s Resource Guide: Open Government. https://open.alberta.ca/publications/physician-s-resource-guide. Accessed 31 Jan 2022.
  • 30.Discharge Abstract Database metadata (DAD)|CIHI. https://www.cihi.ca/en/discharge-abstract-database-metadata-dad. Accessed 31 Jan 2022.
  • 31.Vital Statistics Forms. https://www.alberta.ca/vital-statistics-forms.aspx. Accessed 16 Jan 2022.
  • 32.Quan H, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi JC, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med. Care. 2005;43(11):1130–1139. doi: 10.1097/01.mlr.0000182534.19832.83. [DOI] [PubMed] [Google Scholar]
  • 33.Chen G, Lix L, Tu K, Hemmelgarn BR, Campbell NR, McAlister FA, Quan H. Hypertension outcome and surveillance team: Influence of using different databases and ‘look back’ intervals to define comorbidity profiles for patients with newly diagnosed hypertension: Implications for health services researchers. PLoS ONE. 2016;11(9):e0162074. doi: 10.1371/journal.pone.0162074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Testa, C. C., Krieger, N., Chen, J. T. & Hanage, W. P. Visualizing the lagged connection between COVID-19 cases and deaths in the United States: An animation using per capita state-level data (January 22, 2020–July 8, 2020). HCPDS Work. Pap. 19, 4 (2020).
  • 35.Zhou W, Qin X, Hu X, Lu Y, Pan J. Prognosis models for severe and critical COVID-19 based on the Charlson and Elixhauser comorbidity indices. Int. J. Med. Sci. 2020;17(15):2257–2263. doi: 10.7150/ijms.50007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Amagasa S, Kashiura M, Yasuda H, Hayakawa M, Yamakawa K, Endo A, Ogura T, Hirayama A, Yasunaga H, Tagami T. Relationship between institutional intensive care volume prior to the COVID-19 pandemic and in-hospital death in ventilated patients with severe COVID-19. Sci. Rep. 2022;12(1):22318. doi: 10.1038/s41598-022-26893-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Harrell FE. Binary logistic regression. In: Frank HE, editor. Regression Modeling Strategies: With Applications to Linear Models Logistic and Ordinal Regression and Survival Analysis. Springer; 2015. pp. 219–274. [Google Scholar]
  • 38.Diaz-Quijano FA. A simple method for estimating relative risk using logistic regression. BMC Med. Res. Methodol. 2012;12:1–6. doi: 10.1186/1471-2288-12-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Szumilas M. Explaining odds ratios. J. Can. Acad. Child Adolesc. Psychiatry. 2010;19(3):227–229. [PMC free article] [PubMed] [Google Scholar]
  • 40.Agarwal A, Saxena A. Malignant tumor detection using machine learning through scikit-learn. Int. J. Pure Appl. Math. 2018;119(15):2863–2874. [Google Scholar]
  • 41.Marneni D, Vemula S. Analysis of Covid-19 using machine learning techniques. In: Goswami T, Sinha GR, editors. Statistical Modeling in Machine Learning. Academic Press; 2023. pp. 37–53. [Google Scholar]
  • 42.Unal I. Defining an optimal cut-point value in ROC analysis: An alternative approach. Comput. Math. Methods Med. 2017 doi: 10.1155/2017/3762651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3(1):32–35. doi: 10.1002/1097-0142(1950)3:1&#x0003c;32::AID-CNCR2820030106&#x0003e;3.0.CO;2-3. [DOI] [PubMed] [Google Scholar]
  • 44.Fluss R, Faraggi D, Reiser B. Estimation of the Youden Index and its associated cutoff point. Biometr. J. 2005;47(4):458–472. doi: 10.1002/bimj.200410135. [DOI] [PubMed] [Google Scholar]
  • 45.Perkins NJ, Schisterman EF. The Youden Index and the optimal cut-point corrected for measurement error. Biometr. J. 2005;47(4):428–441. doi: 10.1002/bimj.200410133. [DOI] [PubMed] [Google Scholar]
  • 46.Krittanawong C, Virk HUH, Kumar A, Aydar M, Wang Z, Stewart MP, et al. Machine learning and deep learning to predict mortality in patients with spontaneous coronary artery dissection. Sci. Rep. 2021;11(1):1–10. doi: 10.1038/s41598-021-88172-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Mahmoudi E, Kamdar N, Kim N, Gonzales G, Singh K, Waljee AK. Use of electronic medical records in development and validation of risk prediction models of hospital readmission: Systematic review. BMJ. 2020;369:958. doi: 10.1136/bmj.m958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Singh D, Singh B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020;97:105524. doi: 10.1016/j.asoc.2019.105524. [DOI] [Google Scholar]
  • 49.Marmolejo-Ramos F, Cousineau D, Benites L, Maehara R. On the efficacy of procedures to normalize Ex-Gaussian distributions. Front. Psychol. 2015;5:1548. doi: 10.3389/fpsyg.2014.01548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.García V, Sánchez JS, Marqués AI, Florencia R, Rivera G. Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst. Appl. 2020;158:113026. doi: 10.1016/j.eswa.2019.113026. [DOI] [Google Scholar]
  • 51.Garcia-Carretero R, Roncal-Gomez J, Rodriguez-Manzano P, Vazquez-Gomez O. Identification and predictive value of risk factors for mortality due to listeria monocytogenes infection: Use of machine learning with a nationwide administrative data set. Bacteria. 2022;1(1):12–32. doi: 10.3390/bacteria1010003. [DOI] [Google Scholar]
  • 52.Alsinglawi B, Alshari O, Alorjani M, Mubin O, Alnajjar F, Novoa M, et al. An explainable machine learning framework for lung cancer hospital length of stay prediction. Sci. Rep. 2022;12(1):607. doi: 10.1038/s41598-021-04608-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.ARC Cluster Guide: RCSWiki. https://rcs.ucalgary.ca/ARC_Cluster_Guide. Accessed 30 Jan 2023.
  • 54.Li, Z. An empirical study on bash language usage in Github. Master Thesis. (University of Waterloo, 2021). https://uwspace.uwaterloo.ca/handle/10012/17036.
  • 55.Breiman L. Random forests. Mach. Learn. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
  • 56.Ooka T, Johno H, Nakamoto K, Yoda Y, Yokomichi H, Yamagata Z. Random forest approach for determining risk prediction and predictive factors of type 2 diabetes: Large-scale health check-up data in Japan. BMJ Nutr. Prev. Health. 2021;4(1):140. doi: 10.1136/bmjnph-2020-000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Yu W, Liu T, Valdez R, Gwinn M, Khoury MJ. Application of support vector machine modeling for prediction of common diseases: The case of diabetes and pre-diabetes. BMC Med. Inform. Decis. Making. 2010;10(1):1–7. doi: 10.1186/1472-6947-10-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Ogunleye A, Wang QG. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020;17(6):2131–2140. doi: 10.1109/TCBB.2019.2911071. [DOI] [PubMed] [Google Scholar]
  • 59.Shahid N, Rappon T, Berta W. Applications of artificial neural networks in health care organizational decision-making: A scoping review. PLoS ONE. 2019;14(2):e0212356. doi: 10.1371/journal.pone.0212356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Lee CW, Park JA. Assessment of HIV/AIDS-related health performance using an artificial neural network. Inf. Manag. 2001;38(4):231–238. doi: 10.1016/S0378-7206(00)00068-9. [DOI] [Google Scholar]
  • 61.Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, et al. Artificial intelligence in healthcare: Past, present and future. Stroke Vasc. Neurol. 2017;2:4. doi: 10.1136/svn-2017-000101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Bartosch-Härlid A, Andersson B, Aho U, Nilsson J, Andersson R. Artificial neural networks in pancreatic disease. Br. J. Surg. 2008;95(7):817–826. doi: 10.1002/bjs.6239. [DOI] [PubMed] [Google Scholar]
  • 63.Goss EP, Vozikis GS. Improving health care organizational management through neural network learning. Health Care Manag. Sci. 2002;5(3):221–227. doi: 10.1023/A:1019760901191. [DOI] [PubMed] [Google Scholar]
  • 64.Nolting J. Developing a neural network model for health care. Proc. AMIA Annu. Symp. 2006;2006:1049. [PMC free article] [PubMed] [Google Scholar]
  • 65.Agatonovic-Kustrin S, Beresford R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J. Pharm. Biomed. Anal. 2000;22(5):717–727. doi: 10.1016/S0731-7085(99)00272-1. [DOI] [PubMed] [Google Scholar]
  • 66.Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: The Achilles heel of predictive analytics. BMC Med. 2019;17(1):1–7. doi: 10.1186/s12916-019-1466-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Weisberg, S. Yeo-Johnson Power Transformations. (Department of Applied Statistics, University of Minnesota, 2001).
  • 68.Gupta S, Hayek SS, Wang W, Chan L, Mathews KS, Melamed ML, et al. Factors associated with death in critically ill patients with coronavirus disease 2019 in the US. JAMA Intern. Med. 2020;180(11):1436–1447. doi: 10.1001/jamainternmed.2020.3596. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Grasselli G, Greco M, Zanella A, Albano G, Antonelli M, Bellani G, et al. Risk factors associated with mortality among patients with COVID-19 in intensive care units in Lombardy, Italy. JAMA Intern. Med. 2020;180(10):1345–1355. doi: 10.1001/jamainternmed.2020.3539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Kuderer NM, Choueiri TK, Shah DP, Shyr Y, Rubinstein SM, Rivera DR, et al. Clinical impact of COVID-19 on patients with cancer (CCC19): A cohort study. Lancet. 2020;395:10241. doi: 10.1016/S0140-6736(20)31187-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Williamson EJ, Walker AJ, Bhaskaran K, Bacon S, Bates C, Morton CE, et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature. 2020;584(7821):430–436. doi: 10.1038/s41586-020-2521-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Unnikrishnan VK, Choudhari KS, Kulkarni SD, Nayak R, Kartha VB, Santhosh C. Analytical predictive capabilities of laser induced breakdown spectroscopy (LIBS) with principal component analysis (PCA) for plastic classification. RSC Adv. 2013;3(48):25872–25880. doi: 10.1039/C3RA44946G. [DOI] [Google Scholar]
  • 73.Dairi A, Harrou F, Zeroual A, Hittawe MM, Sun Y. Comparative study of machine learning methods for COVID-19 transmission forecasting. J. Biomed. Inform. 2021;118:103791. doi: 10.1016/j.jbi.2021.103791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Mufti HN, Hirsch GM, Abidi SR, Abidi SSR. Exploiting machine learning algorithms and methods for the prediction of agitated delirium after cardiac surgery: Models development and validation study. JMIR Med. Inform. 2019;7(4):e14993. doi: 10.2196/14993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Bragg WH. On the absorption of α rays, and on the classification of the α rays from radium. Philos. Mag. J. Sci. 1994;8(48):719–725. doi: 10.1080/14786440409463245. [DOI] [Google Scholar]
  • 76.Use of Administrative Data. https://www150.statcan.gc.ca/n1/pub/12-539-x/2009001/administrative-administratives-eng.htm. Accessed 21 Aug 2022.
  • 77.Alberta Long-Term Care Resident Profile: Alberta Long-Term Care Resident Profile 2016/2017: Open Government. https://open.alberta.ca/dataset/90c128a6-3a8e-4c6e-8591-58e88fe6b6f9/resource/894a3a9c-8999-4487-b7e5-2850b3bb1a2e/download/cc-ltc-resident-profile-2017.pdf. Accessed 21 Aug 2022.
  • 78.Arnold S, Glushko V. Cause-specific mortality rates: Common trends and differences. Insur. Math. Econ. 2021;99:294–308. doi: 10.1016/j.insmatheco.2021.03.027. [DOI] [Google Scholar]
  • 79.Canada PHA of COVID-19 Daily Epidemiology Update. https://health-infobase.canada.ca/covid-19/epidemiological-summary-covid-19-cases.html. Accessed 15 April 2022.
  • 80.Shameer K, Johnson KW, Glicksberg BS, Dudley JT, Sengupta PP. Machine learning in cardiovascular medicine: Are we there yet? Heart. 2018;104(14):1156–1164. doi: 10.1136/heartjnl-2017-311198. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The data for this study was provided by AHS. Due to the sensitivity of the information, AHS has restricted the use of the data and they are not publicly available. To obtain the data, researchers will need approval from a certified research ethics board in Alberta as well as a Data Sharing Agreement from AHS through ethics boards. The websites for ethics boards and data sharing agreements are respectively located at https://hreba.ca/ and https://www.albertahealthservices.ca/research/Page16074.aspx.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES