Interpretable Machine Learning for COVID-19: An Empirical Study on Severity Prediction Task

Han Wu; Wenjie Ruan; Jiangtao Wang; Dingchang Zheng; Bei Liu; Yayuan Geng; Xiangfei Chai; Jian Chen; Kunwei Li; Shaolin Li; Sumi Helal

doi:10.1109/TAI.2021.3092698

. 2021 Jun 25;4(4):764–777. doi: 10.1109/TAI.2021.3092698

Interpretable Machine Learning for COVID-19: An Empirical Study on Severity Prediction Task

Han Wu ¹, Wenjie Ruan ^1,^✉, Jiangtao Wang ², Dingchang Zheng ², Bei Liu ³, Yayuan Geng ⁴, Xiangfei Chai ⁴, Jian Chen ⁵, Kunwei Li ⁵, Shaolin Li ⁶, Sumi Helal ⁷

PMCID: PMC10620962 PMID: 37954545

Abstract

The black-box nature of machine learning models hinders the deployment of some high-accuracy medical diagnosis algorithms. It is risky to put one’s life in the hands of models that medical researchers do not fully understand or trust. However, through model interpretation, black-box models can promptly reveal significant biomarkers that medical practitioners may have overlooked due to the surge of infected patients in the COVID-19 pandemic. This research leverages a database of 92 patients with confirmed SARS-CoV-2 laboratory tests between 18th January 2020 and 5th March 2020, in Zhuhai, China, to identify biomarkers indicative of infection severity prediction. Through the interpretation of four machine learning models, decision tree, random forests, gradient boosted trees, and neural networks using permutation feature importance, partial dependence plot, individual conditional expectation, accumulated local effects, local interpretable model-agnostic explanations, and Shapley additive explanation, we identify an increase in N-terminal pro-brain natriuretic peptide, C-reaction protein, and lactic dehydrogenase, a decrease in lymphocyte is associated with severe infection and an increased risk of death, which is consistent with recent medical research on COVID-19 and other research using dedicated models. We further validate our methods on a large open dataset with 5644 confirmed patients from the Hospital Israelita Albert Einstein, at São Paulo, Brazil from Kaggle, and unveil leukocytes, eosinophils, and platelets as three indicative biomarkers for COVID-19.

Keywords: Artificial intelligence in health, artificial intelligence in medicine, interpretable machine learning

I. Introduction

The sudden outbreak of COVID-19 has caused an unprecedented disruption and impact worldwide. With more than 100 million confirmed cases as of February 2021, the pandemic is still accelerating globally. The disease is transmitted by inhalation or contact with infected droplets with an incubation period ranging from 2 to 14 days [1], making it highly infectious and difficult to contain and mitigate.

With the rapid transmission of COVID-19, the demand for medical supplies goes beyond hospitals’ capacity in many countries. Various diagnostic and predictive models are employed to release the pressure on healthcare workers. For instance, a deep learning model that detects abnormalities and extract key features of the altered lung parenchyma using chest CT images is proposed [2]. On the other hand, Rich Caruana et al. [3] exploit intelligible models that use generalized additive models with pairwise interactions to predict the probability of readmission. To maintain both interpretability and complexity, DeepCOVIDNet is present to achieve predictive surveillance that identifies the most influential features for the prediction of the growth of the pandemic [4] through the combination of two modules. The embedding module takes various heterogeneous feature groups as input and outputs an equidimensional embedding corresponding to each feature group. The DeepFM [5] module computes second and higher order interactions between them.

Models that achieves high accuracy provide fewer interpretations due to the tradeoff between accuracy and interpretability [6]. To be adopted in healthcare systems that require both interpretability and robustness [7], the multitree XGBoost algorithm is employed to identify the most significant indicators in COVID-19 diagnosis [8]. This method exploits the recursive tree-based decision system of the model to achieve high interpretability. On the other hand, a more complex convolutional neural network (CNN) model can discriminate COVID-19 from non-COVID-19 using chest CT image [9]. It achieves interpretability through gradient-weighted class activation mapping to produce a heat map that visually verifies where the CNN model is focusing.

Besides, several model-agnostic methods have been proposed to peek into black-box models, such as partial dependence plot (PDP) [10], individual conditional expectation (ICE) [11], accumulated local effects (ALE) [12], permutation feature importance [13], local interpretable model-agnostic explanations (LIME) [14], Shapley additive explanation (SHAP) [15], and anchors [16]. Most of these model-agnostic methods are reasoned qualitatively through illustrative figures and human experiences. To quantitatively measure their interpretability, metrics such as faithfulness [17] and monotonicity [18] are proposed.

In this article, instead of targeting a high-accuracy model, we interpret several models to help medical practitioners promptly discover the most significant biomarkers in the pandemic, as illustrated in Fig. 1.

Fig. 1. — Difference between the usual workflow of machine learning, and our approach.

Overall, this article makes the following contributions.

1)
Evaluation: A systematic evaluation of the interpretability of machine learning models that predict the severity level of COVID-19 patients. We experiment with six interpretation methods and two evaluation metrics on our dataset and receive the same result as research that uses a dedicated model. We further validate our approach on a dataset from Kaggle.
2)
Implication: Through the interpretation of models trained on our dataset, we reveal N-terminal probrain natriuretic peptide (NTproBNP), C-reaction protein (CRP), lactic dehydrogenase (LDH), and lymphocyte (LYM) as the most indicative biomarkers in identifying patients’ severity level. Applying the same approach on the Kaggle dataset, we further unveil three significant features, leukocytes, eosinophils, and platelets.
3)
Implementation: We design a system that healthcare professionals can interact with its AI Models to incorporate model insights with medical knowledge. We release our implementation, models for future research and validation.¹

II. Preliminary of AI Interpretability

In this section, six interpretation methods, partial dependence plot, individual conditional expectation, accumulated local effects, local interpretable model-agnostic explanations, and Shapley additive explanation are summarized. We also summarize two evaluation metrics, faithfulness, and monotonicity.

A. Model-Agnostic Methods

In healthcare, restrictions to using only interpretable models bring many limitations in adoption while separating explanations from the model can afford several beneficial flexibilities [19]. As a result, model-agnostic methods have been devised to provide interpretations without knowing model details.

1)
Partial Dependence Plot: PDPs reveal the dependence between the target function and several target features. The partial function is estimated by calculating averages in the training data, also known as the Monte Carlo method. After setting up a grid for the features we are interested in (target features), we set all target features in our training set to be the value of grid points, then make predictions and average them all at each grid. The drawback of PDP is that one target feature produces 2D plots and two produce 3D plots while it can be pretty hard for a human to understand plots in higher dimensions
2)
Individual Conditional Expectation: ICE is similar to PDP. The difference is that PDP calculates the average over the marginal distribution while ICE keeps them all. Each line in the ICE plot represents predictions for each individual. Without averaging on all instances, ICE unveils heterogeneous relationships but is limited to only one target feature since two features result in overlay surfaces that cannot be identified by human eyes [20].
3)
Accumulated Local Effects: ALE averages the changes in the predictions and accumulate them over the local grid. The difference with PDP is that the value at each point of the ALE curve is the difference to the mean prediction calculated in a small window rather than all of the grid. Thus ALE eliminates the effect of correlated features [20] which makes it more suitable in healthcare because it is usually irrational to assume young people having similar physical conditions with the elderly.
4)
Permutation Feature Importance: The idea behind permutation feature importance is intuitive. A feature is significant for the model if there is a noticeable increase in the model’s prediction error after permutation. On the other hand, the feature is less important if the prediction error remains nearly unchanged after shuffling.
5)
Local Interpretable Model-Agnostic Explanations: LIME uses interpretable models to approximate the predictions of the original black-box model in specific regions. LIME works for tabular data, text, and images, but the explanations may not be stable enough for medical applications.
6)
Shapley Additive Explanation: SHAP borrows the idea of Shapley value from game theory [21], which represents contributions of each player in a game. Calculating Shapley values is computationally expensive when there are hundreds of features, thus Lundberg and Lee [15] proposed a fast implementation for tree-based models to boost the calculation process. SHAP has a solid theoretical foundation but is still computationally slow for a lot of instances.

To summarize, PDP, ICE, and ALE only use graphs to visualize the impact of different features while permutation feature importance, LIME, and SHAP provide numerical feature importance that quantitatively ranks the importance of each feature.

B. Metrics for Interpretability Evaluation

Different interpretation methods try to find out the most important features to provide explanations for the output. But as Doshi-Velez and Kim [6] questioned, “Are all models in all defined-to-be-interpretable model classes equally interpretable?” And how can we measure the quality of different interpretation methods?

Faithfulness: Faithfulness incrementally removes each of the attributes deemed important by the interpretability metric, and evaluate the effect on the performance. Then it calculates the correlation between the weights (importance) of the attributes and corresponding model performance and returns correlation between attribute importance weights and the corresponding effect on classifier [17].

Monotonicity: Monotonicity incrementally adds each attribute in order of increasing importance. As each feature is added, the performance of the model should correspondingly increase, thereby resulting in monotonically increasing model performance, and it returns true of false [18].

In our experiment, both faithfulness and monotonicity are employed to evaluate the interpretation of different machine learning models.

III. Empirical Study on COVID

In this section, features in our raw dataset and procedures of data preprocessing are introduced. After preprocessing, four different models: Decision tree, random forest, gradient boosted trees, and neural networks are trained on the dataset. Model interpretation is then employed to understand how different models make predictions, and patients that models make false diagnoses are investigated respectively.

A. Dataset and Perprocessing

The raw dataset consists of patients with confirmed SARS-CoV-2 laboratory tests between 18th January 2020 and 5th March 2020, in Zhuhai, China. Our Research Ethics Committee waived written informed consent for this retrospective study that evaluated deidentified data and involved no potential risk to patients. All the data of patients have been anonymized before analysis.

Tables in the Appendix list all 74 features in the raw dataset consisting of body mass index (BMI), complete blood count (CBC), blood biochemical examination, inflammatory markers, symptoms, anamneses, among others. Whether or not health care professionals will order a test for patients is based on various factors such as medical history, physical examination, and etc. Thus, there is no standard set of tests that are compulsory for every individual which introduces data sparsity. For instance, left ventricular ejection fraction (LVEF) are mostly empty because most patients are not required to take the color doppler ultrasound test.

After pruning out irrelevant features, such as patients’ medical numbers that provide no medical information, and features that have no patients’ records (no patient took this test), 86 patients’ records with 55 features are selected for further investigation. Among those, 77 records are used for training, cross-validation, and 9 reserved for testing. The feature for classification is Severity01 which indicates normal with 0, and severe with 1. More detailed descriptions about features in our dataset are listed in the Appendix.

Feature engineering is applied before training and interpreting our models, as some features may not provide valuable information or provide redundant information.

First, constant and quasi-constant features were removed. For instance, the two features, PCT2 and Stomachache, have the same value for all patients providing no valuable information in distinguishing normal and severe patients.

Second, correlated features were removed because they provide redundant information. Table I lists all correlated features using Pearson’s correlation coefficient.

TABLE I. Feature Correlation.

Feature 1	Feature 2	Correlation
cTnICKMBOrdinal1	cTnICKMBOrdinal2	0.853 741
LDH	HBDH	0.911 419
NEU2	WBC2	0.911 419
LYM2	LYM1	0.842 688
NTproBNP	N2L2	0.808 767
BMI	Weight	0.842 409
NEU1	WBC1	0.90 352

Layer type	Output shape	Param
Dense	(None, 10)	370
Dropout	(None, 10)	0
Dense	(None, 15)	165
Dropout	(None, 15)	0
Dense	(None, 5)	80
Dropout	(None, 5)	0
Dense	(None, 1)	6

Statistical methods	Removed features
Mutual information	Height, CK, HiCKMB, Cr, WBC1, hemoptysis
Univariate	Weight, AST, CKMB, PCT1, WBC2

Model	Most important features
Decision tree	NTproBNP, CRP2, ALB2, temp, symptom
Random forest	CRP2, NTproBNP, cTnI, LYM1, ALB2
Gradient boosted trees	CRP2, cTnITimes, LYM1, NTproBNP, phlegm
Neural networks	NTproBNP, CRP2, CRP1, LDH, age

Classifier	CV	Test set			95% confidence interval
	F1	Precision	Recall	F1
Decision tree	0.55	0.67	0.50	0.57	0.31
Random forest	0.62	0.67	0.50	0.57	0.31
Gradient boosted trees	0.67	0.78	1.00	0.80	0.27
Neural networks	0.58	0.78	1.00	0.80	0.27

No	Class	Probability of severe	Prediction	Type
2	Normal	0.53	Severe	False positive
5	Severe	0.24	Normal	False negative

Feature	Patient 5 (Severe)	Patient 2 (Normal)
Sex	1.00	1.00
Age	63.00	42.00
AgeG1	1.00	0.00
Temp	36.40	37.50
cTnITimes	7.00	8.00
cTnI	0.01	0.01
cTnICKMBOrdinal1	0.00	0.00
LDH	220.00	263.00
NTproBNP	433.00	475.00
LYM1	1.53	1.08
N2L1	3.13	2.16
CRP1	22.69	36.49
ALB1	39.20	37.60
CRP2	22.69	78.76
ALB2	36.50	37.60
Symptoms	None	Fever
NDisease	Hypertention	Hypertention, DM, hyperlipedia

Model	LIME
Decision tree	NTproBNP, CRP2, NauseaNVomit
Random forest	NTproBNP, CRP2, CRP1
Gradient boosted trees	NTproBNP, CRP2, LYM1
Neural networks	NTproBNP, CRP2, PoorAppetite
Model	SHAP
Decision tree	CRP2, NTproBNP, ALB2
Random forests	CRP2, CRP1, LDH
Gradient boosted trees	CRP2, NTproBNP, LDH
Neural networks	CRP2, NTproBNP, CRP1

Feature	Comments
Severity03	Severe (3) - normal (0)
Severity01	Severe (1), normal (0)

Feature	Comments
MedNum	Medical number
No	Patient no.
Sex	Man (1), woman(0)
Age	–
AgeG1
Height	–
Weight	–
BMI	Body mass index

Feature	Comments
WBC1	White blood cell (first time)
NEU1	Neutrophil count (first time)
LYM1	Lymphocyte count (first time)
N2L1	–
WBC2	White blood cell (second time)
NEU2	Neutrophil count (second time)
LYM2	Lymphocyte count (second time)
N2L2	–

Severity level	Average age
0	36.83
1	47.45
2	54.31
3	69.40

Feature	Comments
AST	Aspartate aminotransferase
LDH	Lactate Dehydrogenase
CK	Creatine Kinase
CKMB	Amount of an isoenzyme of creatine kinase (CK)
HBDH	Alpha-hydroxybutyrate dehydrogenase
HiCKMB	Highest CKMB
Cr	Serum creatinine
ALB1	Albumin count (first time)
ALB2	Albumin count (second time)

Feature	Comments
Temp	Temperature
LVEF	Left ventricular ejection fraction
Onset2Admi	Time from onset to admission
Onset2CT1	Time from onset to CT test
Onset2CTPositive1	Time from onset to CT test positive
Onset2CTPeak	Time from onset to CT peak
cTnITimes	When was cTnI tested
cTnI	Cardiac troponin I
cTnlCKMBOrdinal1	The value when hospitalized
cTnlCKMBOrdinal2	The maximum value when hospitalized
CTScore	Peak CT score
AIVolumneP	Peak volume
SO2	Empty
PO2	Empty
YHZS	Empty
RUL	Empty
RML	Empty
RLL	Empty
LUL	Empty
LLL	Empty

Feature	Value
SARS-Cov-2 test result	1
Patient age quantile	14.00
Hematocrit	0.92
Platelets	1.26
Mean platelet volume	0.79
Mean corpuscular hemoglobin concentration (MCHC)	0.65
Leukocytes	1.47
Basophils	1.14
Eosinophils	0.83
Monocytes	0.96
Proteina C reativa mg/dL	0.236

Feature	Comments
PCT1	Procalcitonin (first time)
CRP1	C-reactive protein (first time)
PCT2	Procalcitonin (second time)
CRP2	C-reactive protein (second time)

Feature	Comments
Symptom	–
Fever	–
Cough	–
Phlegm	–
Hemoptysis	–
SoreThroat	–
Catarrh	–
Headache	–
ChestPain	–
Fatigue	–
SoreMuscle	–
Stomachache	–
Diarrhea	–
PoorAppetie	–
NauseaNVomit	–
Hypertention	–
Hyperlipedia	–
DM	Diabetic mellitus
Lung	Lunge disease
CAD	Coronary heart disease
Arrythmia	–
Cancer	–

Model	Most important features
Decision tree	Leukocytes, eosinophils, patient age quantile
Random forest	Leukocytes, eosinophils, platelets
Gradient boosted trees	Patient age quantile, hematocrit, platelets
Neural networks	Leukocytes, platelets, monocytes

Models	LIME	SHAP
Random forests	0.37	0.59
Gradient boosted trees	0.46	0.49
Neural networks	0.45	0.33

Models	LIME	SHAP
Random forests	False	False
Gradient boosted trees	22% True	22% True
Neural networks	False	False

Models	LIME	SHAP
Random forests	0.71	0.82
Gradient boosted trees	0.61	0.72
Neural networks	0.25	0.42

PERMALINK

Interpretable Machine Learning for COVID-19: An Empirical Study on Severity Prediction Task

Han Wu

Wenjie Ruan

Jiangtao Wang

Dingchang Zheng

Bei Liu

Yayuan Geng

Xiangfei Chai

Jian Chen

Kunwei Li

Shaolin Li

Sumi Helal

Abstract

I. Introduction

Fig. 1.

II. Preliminary of AI Interpretability

A. Model-Agnostic Methods

B. Metrics for Interpretability Evaluation

III. Empirical Study on COVID

A. Dataset and Perprocessing

TABLE I. Feature Correlation.

B. Training Models

TABLE III. Structure of Neural Networks.

TABLE II. Features With Mutual Information.

C. Interpretation (Permutation Feature Importance)

TABLE V. Five Most Important Features.

TABLE IV. Classification Results on Our Dataset.

D. Interpretation (PDP, ICE, ALE)

Fig. 2.

Fig. 3.

Fig. 4.

E. Misclassified Patients

F. Interpretation (False Negative)

1). Wrong Diagnoses

Fig. 5.

Fig. 6.

2). Correct Diagnoses

G. Interpretation (False Positive)

TABLE VI. Misclassified Patients.

TABLE VIII. Record of the False Positive Patient 2.

TABLE VII. Average Age in Different Severity Levels.

1). Doctors’ Diagnoses

2). Models’ Diagnoses

TABLE IX. Most Important Features From LIME and SHAP.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

TABLE XVII. Diagnoses.

TABLE XVIII. Personal Info.

TABLE XIX. Complete Blood Count.

TABLE XX. Inflammatory Markers.

TABLE XXI. Biochemical Examination.

TABLE XXII. Symptoms and Anamneses.

TABLE XXIII. Other Test Results.

H. Evaluating Interpretation

I. Summary

IV. Validation on Other Datasets

A. Validation on 485 Infected Patients in China

B. Validation on 5644 Infected Patients in Brazil

TABLE XIV. Five Most Important Features (Kaggle).

TABLE X. Failthfulness Evaluation.

TABLE XI. Monotonicity Evaluation.

TABLE XII. Patient No. 0 in the Kaggle Dataset.

TABLE XIII. Classification Results (Kaggle).

TABLE XV. Failthfulness Evaluation (Kaggle).

TABLE XVI. Monotonicity Evaluation (Kaggle).

V. Conclusion

Appendix.

Funding Statement

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases