Predicting death risk analysis in fully vaccinated people using novel extreme regression-voting classifier

Eysha Saad; Saima Sadiq; Ramish Jamil; Furqan Rustam; Arif Mehmood; Gyu Sang Choi; Imran Ashraf

doi:10.1177/20552076221109530

. 2022 Jul 21;8:20552076221109530. doi: 10.1177/20552076221109530

Predicting death risk analysis in fully vaccinated people using novel extreme regression-voting classifier

Eysha Saad ^1,^*, Saima Sadiq ^1,^*, Ramish Jamil ¹, Furqan Rustam ², Arif Mehmood ³, Gyu Sang Choi ^4,^✉, Imran Ashraf ^4,^✉

PMCID: PMC9309760 PMID: 35898288

Abstract

Vaccination for the COVID-19 pandemic has raised serious concerns among the public and various rumours are spread regarding the resulting illness, adverse reactions, and death. Such rumours can damage the campaign against the COVID-19 and should be dealt with accordingly. One prospective solution is to use machine learning-based models to predict the death risk for vaccinated people by utilizing the available data. This study focuses on the prognosis of three significant events including ‘not survived’, ‘recovered’, and ‘not recovered’ based on the adverse events followed by the second dose of the COVID-19 vaccine. Extensive experiments are performed to analyse the efficacy of the proposed Extreme Regression- Voting Classifier model in comparison with machine learning models with Term Frequency-Inverse Document Frequency, Bag of Words, and Global Vectors, and deep learning models like Convolutional Neural Network, Long Short Term Memory, and Bidirectional Long Short Term Memory. Experiments are carried out on the original, as well as, a balanced dataset using Synthetic Minority Oversampling Approach. Results reveal that the proposed voting classifier in combination with TF-IDF outperforms with a 0.85 accuracy score on the SMOTE-balanced dataset. In line with this, the validation of the proposed voting classifier on binary classification shows state-of-the-art results with a 0.98 accuracy.

Keywords: COVID-19, post-vaccination symptoms, adverse reactions, machine learning

Introduction

The last two decades have witnessed many pandemics like SARS (Severe Acute Respiratory Syndrome), MERS (Middle East Respiratory Syndrome), COVID-19 (coronavirus disease 2019 ), etc. Recently, COVID-19 infected approximately 308 million people in 223 countries leading to 5.492 million deaths as of 12 January 2020¹. The ongoing COVID-19 pandemic impacted the individual, as well as, the public life of human beings on a global scale, and containing it seems to be very difficult in the near future. Although, it possibly can be confined like other viruses, such as HKU1, NL63, 229E, and OC43, however, the substantial human and financial loss remains the main concern². Precautionary measures against COVID-19, such as sanitation procedures, physical distancing, personal hygiene, mask usage, disinfection of the surfaces, and frequent hand washing are essential to reduce its spread. However, the case fatality ratio (CFR), a measure of mortality among infected cases, continues to increase³. Facilitating a safe return to normal life along with minimization of the COVID-19 resurgence requires the immunity against COVID-19⁴ which is aimed by several developed vaccines like Moderna, Pfizer (BioNTech), and Johnson & Johnson, etc.⁵ As of December 2020, several vaccines have been administered with different efficiency and immunity against COVID-19, as shown in Figure 1.

Figure 1. — Efficacy of COVID-19 vaccines.

Similar to vaccines for other diseases, COVID-19 vaccines have been reported for several side effects. Reports of adverse side effects following the doses of COVID-19 vaccination are submitted to VAERS (Vaccine Adverse Event Reporting System). From 1 January 2021 to 19 March 2021, a total of 5351 adverse events have been reported to VAERS. The adverse side effects range from mild to severe such as fever, pain, diarrhoea, fatigue, blood pressure, chills, muscle pain, headache, and pain at the injection site and are shown in Figure 2(a). Similarly, several COVID-19 positive cases are reported after being vaccinated. Further include dizziness and severe allergic reactions. Blood clotting, cardiac problems, and resulting deaths are also reported following adverse events such as cardiac arrest, abdominal pain, etc. as shown in Figure 2(b). There is also a theoretical risk that vaccination could make infection severe by enhancing the respiratory disease⁶. Such adverse reaction and death reports make it significantly important to analyse the data regarding the adverse effects of COVID-19 vaccines and report reactions with a higher probability of fatality to assist healthcare professionals in prioritizing the cases with adverse effects and provide timely medical treatment.

Figure 2. — Word cloud of reported adverse reactions, (a) side effects following the doses of COVID-19 based on reports submitted to Vaccine Adverse Event Reporting System (VAERS), and (b) side effects of death cases post COVID-19 vaccine.

The ML (machine learning) is the self-regulated discovery of potentially valid or useful knowledge and novel hidden patterns from dataset⁷. ML models operate by revealing relationships and patterns among the data instances in single or multiple datasets. ML has been widely applied in the healthcare sectors for its applications in simulating health outcomes, forecasting patient outcomes, and evaluating medicines⁸. In recent years, ML has also been extensively used in the diagnosis and prognosis of many diseases like COVID-19, as immense data is being generated regarding COVID-19 on an everyday basis, which can be analysed to predict the COVID-19 case and devise corresponding policies to contain the pandemic. In the same vein, data associated with adverse events reports post-COVID-19, gathered by VAERS was made public on 27 January 2021 which motivated current research.

This study demonstrates an enhanced ML-based prediction system to analyse the adverse events associated with the COVID-19 vaccine and predict individuals with symptoms that might cause fatality so that healthcare professionals can treat the individuals beforehand. It helps medical experts critically monitor vaccinated individuals with death risks. This study makes the following major contributions:

This study advocates a systematic approach to investigate the adverse events following the COVID-19 vaccine for possible death leading symptoms. The prognosis of three significant events including ‘not survived’, ‘recovered’, and ‘not recovered’ is made in this regard.
A novel vote-based ER-VC (Extreme Regression-Voting Classifier) is devised which combines ET and LR under soft voting criterion to increase the prediction accuracy. Extensive experiments are carried out for performance analysis concerning many machine learning models like RF (Random Forest), LR (Logistic Regression), MLP (Multilayer Perceptron), GBM (Gradient Boosting Machine), AB (AdaBoost), kNN (k Nearest Neighbours), and ET (Extra Tree Classifier). In addition LSTM (Long Short Term Memory), CNN (Convolutional Neural Network), and BiLSTM (Bidirectional LSTM) are also implemented for appraising the performance of the proposed approach.
To analyse the influence of data balancing, the performance of ML models is analysed and compared by integrating SMOTE (Synthetic Minority Oversampling Technique) for predicting the survival of vaccinated individuals.

The structure of this research is organized into five sections. Section ‘Related work’ represents the previous works related to this study. Later, the proposed approach, ML models, and dataset description are provided in Section ‘Material and methods’. Section ‘Results and discussion’ provides the analysis and discussion of the results. In the end, the study is concluded in Section ‘Conclusion’.

Related work

The COVID-19 pandemic inflicted substantial economic and human losses worldwide. With unusual symptoms, the disease is difficult to treat based on previous methods used for treatment. However, the strong infrastructure of electronic health records and advanced technologies in recent times has helped in conducting several research studies and exploration of its treatment. The data repositories of COVID-19 patients’ symptoms and track records are maintained efficiently by medical and government institutions to explore health risks. Laboratory tests, radiological reports, and patients’ symptoms have been analysed using ML models by many researchers. Early studies, mostly focused on disease diagnoses and predicting the death rate of COVID-19 patients based on statistical models⁹. After some time, hospital records of patients are mostly used to identify potential risks¹⁰.

The exacerbated outbreak of the COVID-19 pandemic and its potential risk to human lives necessitated different medical research laboratories and pharma industries to start developing the COVID-19 vaccine at a fast pace. For providing herd immunity to people, there was a need for a safe and effective vaccine in a short time¹¹. At the end of 2020, 48 vaccines were available at the clinical trial phase, and three vaccines including Pfizer, Moderna, and AstraZeneca completed this phase in the US¹². During the first phase, millions of health professionals were vaccinated, then populations at higher risk, such as people older than 65 years are covered¹³.

Severe outcomes leading to the death risk of COVID-19 patients are associated with different pre-existing medical conditions and comorbidities^14,15. Approximately more than 40% of patients hospitalized with COVID-19 had at least one comorbidity¹⁶. In a similar study, the authors analysed comorbidities between survivor and non-survivor patients¹⁷. Common diseases included diabetes mellitus, cardiovascular disease, chronic obstructive pulmonary disease, hypertension, and kidney-related diseases. Various other biomarkers such as C-reactive protein, high level of ferritin, white lymphocyte count, blood cell count, procalcitonin, and d-dimer are related to health risks and are increasing the mortality rate of COVID-19 patients¹⁸. These biomarkers and other symptoms could offer advantages in predicting death risks.

Various types of deep learning architectures have also been employed for different tasks. For example, the bidirectional neural network is proposed by Onan¹⁹ that uses a group-wise enhancement mechanism for feature extraction. By dividing features into multiple groups, important features from each group can be obtained to increase the performance. Similarly, a bidirectional LSTM model is presented by Onan and Korukoğlu²¹ that combines term weighting using inverse gravity moment with trigrams. Ensemble models are also reported to produce better results for sentiment analysis tasks^22,23. Such models utilize different ensemble schemes, clustering, and feature extraction approaches for increased performance. For example, Onan²⁴ devises a feature extraction approach for sentiment analysis while Onan et al.²⁵ follows a hybrid ensemble model using the concept of consensus clustering. Similarly, Onan^26,27 adopts ensemble models for sentiment analysis and opinion mining²⁸. Along the same lines, topic modelling is focused on using ensemble models by Onan^29,30. The topic of sarcasm detection is covered by Onan³¹ by following a hybrid model approach while Sadiq et al.³² investigates aggression detection. The authors have explored many ML-based techniques using patients’ symptoms and laboratory reports during hospitalization³³. Researchers are diligent in defeating COVID-19 by exploring ways of COVID-19 detection³⁴ and devising frameworks to control the spread of disease³⁵. Researchers applied an ML model to electronic health records to predict the mortality rate of COVID-19 patients³⁶. However, the non-infected population is getting benefits from vaccination. Because of heterogeneity among the population due to demographic categories, risk patterns regarding COVID-19 disease and vaccine are difficult to predict. Different factors are involved in predicting death risks such as unique health history, obesity, cancer history, hereditary diseases, and different immunity levels. Medical professionals are striving to allocate resources and provide help in maximizing the survival probability.

This study makes a significant contribution toward maximizing the survival rate of vaccinated people by predicting the probability of fatal outcomes beforehand by analysing the post-vaccination symptoms. We leveraged growing electronic records and advanced predictive analytical methods to predict the risk associated with the side effects of COVID-19 vaccines.

Material and methods

This study works on the highly accurate prognosis of death risk patients in addition to recovered and not recovered cases concerning the adverse events reported after the second dose of the COVID-19 vaccine. Experiments in this research can be categorized into two stages where Stage I deals with the multiclass classification of adverse events as ‘not survived’, ‘recovered’, and ‘not recovered’ while Stage II or validation stage is concerned with the binary classification of the adverse reactions into ‘survived’ and ‘not survived’. This section contains a brief description of the dataset utilized in this study, as well as, the proposed methodology adopted for classification tasks.

Dataset description

This study utilizes the COVID-19 VAERS dataset acquired from Kaggle which is an open repository for benchmark datasets³⁷. The dataset contains the adverse events reported by individuals after the COVID-19 vaccine along with details related to the particular individuals³⁸. It consists of a total of 5351 records and 35 variables, details of which are given in Table 1. The study is concerned with investigating the death risk of vaccinated individuals by analysing the adverse events. On that account, we utilized only three variables such as ‘RECOVD’, ‘DIED’, and ‘SYMPTOM_TEXT’ for multiclass classification and two variables including, ‘DIED’ and ‘SYMPTOM_TEXT’ for binary class classification. The variable ‘DIED’ comprises two classes involving ‘survived’ and ‘not survived’ corresponding to 4541 and 810 records, respectively. Whereas, the variable ‘RECOVD’ comprises three target variables, including ‘recovered’, ‘not recovered’, and ‘recovery status unknown’ corresponding to 1143, 2398, and 1810 records, respectively. Some of the ‘DIED’ cases are regarded as ‘not recovered’ while some belong to the ‘recovery status unknown’ category as shown in Figure 3(a). The correspondence between the ‘DIED’ and ‘RECOVD’ features shows that a portion of the cases which did not recover from COVID-19 did not survive after being vaccinated. Figure 3(b) reveals that adverse events leading to the death of the vaccinated individuals comprise 15% of the dataset which shows that there is an unequal distribution of class in both binary class and multiclass distribution. For an effective analysis, we disregarded the records which correspond to ‘recovery status unknown’ except for the ones which belong to the ‘not survived’ category in the multiclass classification.

Table 1.

Description of data attributes of COVID-19 World Vaccine Adverse Reactions dataset.

Variable	Description
VAERS_ID	Identification number for each vaccinated case
RECVDATE	Receiving date of adverse reactions report
STATE	Region of the country from which report was received
AGE_YRS	Age of vaccinated individual
CAGE_YR	Age calculation of individual in years
CAGE_MO	Age calculation of vaccinated individual in months
SEX	Gender of vaccinated individual
RPT_DATE	Date on which report form was completed
SYMPTOM_TEXT	Reported symptoms
DIED	Survival status
DATEDIED	Date of death of vaccinated individual
L_THREAT	Severe illness
ER_VISIT	Visited doctor or emergency room
HOSPITAL	Is hospitalized or not
HOSPDAYS	Number of days individual was hospitalized
X_STAY	Elongation of hospitalized days
DISABLE	Disability status of vaccinated individual
RECOVD	Recovery status of vaccinated individual
VAX_DATE	Date on which individual was vaccinated
ONSET_DATE	Onset date of adverse event
NUMDAYS	ONSET_DATE-VAX_DATE
LAB_DATA	Laboratory reports
V_ADMINBY	Vaccine administration facility
V_FUNDBY	Funds used by administration to buy vaccine
OTHER_MEDS	Other medicines in use by vaccinated individual
CUR_ILL	Information regarding illness of individual at the time of getting vaccinated
HISTORY	Long-standing or chronic health-related conditions
PRIOR_VAX	Information regarding prior vaccination
SPLTTYPE	Manufacturer Report Number
FORM_VERS	Version 1 or 2 of VAERS form
TODAYS_DATE	Form completion date
BIRTH_DEFECT	Birth defect
OFC_VISIT	Clinic visit
ER_ED_VISIT	Emergency room visit
ALLERGIES	Allergies to any product

	Multiclass classification		Binary classification
Target variables	Original	SMOTE	Original	SMOTE
Survived	–	–	4541	4541
Not survived	810	1712	810	4541
Recovered	1142	1712	–	–
Not recovered	1712	1712	–	–
Total records	3664	5136	5351	9028

	Multiclass classification		Binary classification
Split set	Original	SMOTE	Original	SMOTE
Train set	2931	4108	4281	7211
Test set	733	1028	1070	1817

Model	Hyperparameter settings
RF	n_estimators=100, random_state=50, max_depth=300
AB	n_estimators=100, random_state=50
LR	random_state=50, solver=‘saga’,multi_class=‘ovr’,C=3.0
MLP	random_state=50, max_iter=200
GBM	n_estimators=100, learning_rate=1, random_state=50
ET	n_estimators=100, random_state=50, max_depth=300
KNN	n_neighbors=5

Models	Accuracy	Precision	Recall	F1 score
RF	0.71	0.70	0.71	0.70
AB	0.64	0.65	0.64	0.64
ET	0.71	0.70	0.71	0.70
LR	0.73	0.73	0.73	0.72
MLP	0.71	0.71	0.71	0.71
GBM	0.70	0.70	0.70	0.70
kNN	0.66	0.65	0.66	0.65
ER-VC	0.72	0.72	0.72	0.71

Models	Accuracy	Precision	Recall	F1 score
RF	0.60	0.59	0.59	0.59
AB	0.57	0.55	0.55	0.54
LR	0.59	0.58	0.59	0.55
MLP	0.65	0.63	0.65	0.63
ET	0.61	0.59	0.59	0.58
GBM	0.57	0.57	0.57	0.57
kNN	0.55	0.54	0.55	0.54
ER-VC	0.60	0.59	0.60	0.57

Models	Accuracy	Precision	Recall	F1 score
RF	0.81	0.82	0.81	0.81
AB	0.71	0.72	0.71	0.71
ET	0.82	0.83	0.82	0.82
LR	0.82	0.82	0.82	0.82
MLP	0.81	0.81	0.81	0.81
GBM	0.80	0.81	0.80	0.80
kNN	0.64	0.73	0.64	0.55
ER-VC	0.85	0.85	0.85	0.84

Models	Accuracy	Precision	Recall	F1 score
RF	0.78	0.79	0.78	0.78
AB	0.73	0.75	0.73	0.74
ET	0.78	0.78	0.78	0.78
LR	0.79	0.79	0.79	0.79
MLP	0.75	0.75	0.75	0.75
GBM	0.77	0.78	0.77	0.77
kNN	0.60	0.70	0.60	0.55
ER-VC	0.81	0.81	0.81	0.81

Dataset	Models	Acc.	Prec.	Rec.	F1
No SMOTE	LSTM	0.70	0.70	0.70	0.70
	CNN	0.64	0.65	0.65	0.64
	CNN-LSTM	0.67	0.67	0.67	0.67
	BiLSTM	0.69	0.69	0.69	0.69
SMOTE	LSTM	0.81	0.81	0.81	0.81
	CNN	0.81	0.81	0.81	0.81
	CNN-LSTM	0.82	0.82	0.82	0.82
	BiLSTM	0.69	0.69	0.69	0.69

Model	Accuracy	Precision	Recall	F1 score
TF-IDF
RF	0.71	0.71	0.71	0.71
AB	0.67	0.69	0.67	0.68
ET	0.72	0.72	0.72	0.72
LR	0.74	0.74	0.74	0.74
MLP	0.70	0.69	0.70	0.70
GBM	0.71	0.72	0.71	0.71
KNN	0.52	0.70	0.52	0.42
ER-VC	0.75	0.75	0.75	0.75
GloVe
RF	0.61	0.61	0.61	0.61
AB	0.54	0.55	0.54	0.54
ET	0.58	0.57	0.58	0.58
LR	0.60	0.60	0.60	0.59
MLP	0.62	0.62	0.62	0.62
GBM	0.54	0.54	0.54	0.54
KNN	0.55	0.57	0.55	0.55
ER-VC	0.63	0.62	0.63	0.62
BoW
RF	0.69	0.71	0.69	0.70
AB	0.67	0.69	0.67	0.68
ET	0.58	0.57	0.58	0.58
LR	0.73	0.73	0.73	0.73
MLP	0.62	0.62	0.62	0.62
GBM	0.72	0.73	0.72	0.72
KNN	0.55	0.57	0.55	0.55
ER-VC	0.73	0.74	0.73	0.74

Feature	Acc.	Class	Prec.	Rec.	F1
BoW	0.96	Survived	0.97	0.96	0.97
		Not-survived	0.96	0.97	0.96
		Weighted avg	0.96	0.96	0.96
TF-IDF	0.98	Survived	0.98	0.98	0.98
		Not-survived	0.98	0.98	0.98
		Weighted avg	0.98	0.98	0.98
GloVe	0.91	Survived	0.93	0.89	0.91
		Not-survived	0.89	0.93	0.91
		Weighted avg	0.91	0.91	0.91

Feature	Acc.	Class	Prec.	Rec.	F1
BoW	0.94	Survived	0.97	0.96	0.96
		Not-survived	0.79	0.83	0.81
		Weighted avg	0.94	0.94	0.94
TF-IDF	0.96	Survived	0.97	0.98	0.98
		Not-survived	0.90	0.85	0.88
		Weighted avg	0.96	0.96	0.96
GloVe	0.86	Survived	0.95	0.88	0.92
		Not-survived	0.53	0.74	0.62
		Weighted avg	0.89	0.86	0.87

PERMALINK

Predicting death risk analysis in fully vaccinated people using novel extreme regression-voting classifier

Eysha Saad

Saima Sadiq

Ramish Jamil

Furqan Rustam

Arif Mehmood

Gyu Sang Choi

Imran Ashraf

Abstract

Introduction

Figure 1.

Figure 2.

Related work

Material and methods

Dataset description

Table 1.

Figure 3.

Problem statement

Proposed methodology

Figure 4.

Table 2.

Table 3.

Data preprocessing

Feature extraction

Data sampling

ML classifiers

Table 4.

Proposed extreme regression-voting classifier

Figure 5.

Algorithm 1.

Evaluation criteria

Results and discussion

Results for scenario 1

Table 5.

Table 6.

Table 7.

Results for scenario 2

Table 8.

Table 9.

Table 10.

Performance analysis of ML models using different features

Figure 6.

Figure 7.

Performance comparison with deep neural networks

Figure 8.

Figure 9.

Table 11.

Results with data splitting prior to SMOTE

Table 12.

Table 13.

Validation of proposed approach for binary classification

Table 14.

Figure 10.

Table 15.

Conclusion

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases