COFE-Net: An ensemble strategy for Computer-Aided Detection for COVID-19

Avinandan Banerjee; Rajdeep Bhattacharya; Vikrant Bhateja; Pawan Kumar Singh; Aime’ Lay-Ekuakille; Ram Sarkar

doi:10.1016/j.measurement.2021.110289

. 2021 Oct 14;187:110289. doi: 10.1016/j.measurement.2021.110289

COFE-Net: An ensemble strategy for Computer-Aided Detection for COVID-19

Avinandan Banerjee ^a,¹, Rajdeep Bhattacharya ^b,¹, Vikrant Bhateja ^c,^d, Pawan Kumar Singh ^a,^⁎, Aime’ Lay-Ekuakille ^e, Ram Sarkar ^b

PMCID: PMC8516129 PMID: 34663998

Abstract

Biomedical images contain a large volume of sensor measurements, which can reveal the descriptors of the disease under investigation. Computer-based analysis of such measurements helps detect the disease, and thereby swiftly aid medical professionals to choose adequate therapy. In this paper, we propose a robust deep learning ensemble framework known as COVID Fuzzy Ensemble Network, or COFE-Net. This strategy is proposed for the task of COVID-19 screening from chest X-rays (CXR) and CT Scans, as a part of Computer-Aided Detection (CADe) for medical practitioners. We leverage the strategy of Transfer Learning for Convolutional Neural Networks (CNNs) widely adopted in recent literature, and further propose an efficient ensemble network for their combination. The principles of fuzzy logic have been leveraged to combine the measured decision scores generated by three state-of-the-art CNNs – Inception V3, Inception ResNet V2 and DenseNet 201 – through the Choquet fuzzy integral. Experimental results support the efficacy of our approach over empirical ensembling, as the fuzzy ensembling strategy for biomedical measurement consists of dynamic refactoring of the classifier ensemble weights on the fly, based upon the confidence scores for coalitions of inputs. This is the chief advantage of our biomedical measurement strategy over others as other methods do not adjust to the multiple generated measurements dynamically unlike ours.Impressive results on multiple datasets demonstrate the effectiveness of the proposed method. The source code of our proposed method is made available at: https://github.com/theavicaster/covid-cade-ensemble.

Keywords: COVID-19 detection, COFE-Net, Deep learning, Fuzzy integral, Ensemble, Classifier fusion, Chest X-Ray, CT Scan, Biomedical measurement

1. Introduction

The rapid spread of the Novel Coronavirus disease (COVID-19) has been a cause for great concern ever since it first emerged in Wuhan, China in 2019. It has resulted in a global pandemic situation and served as a catalyst to the disruption of normal life worldwide. COVID-19 or SARS-CoV-2 is a severe acute respiratory syndrome, the typical symptoms of which include breathlessness, fever, weakness, cough and cold, and loss of smell and taste. The virus has infected over 176 million people worldwide as of the 15th of June 2021, with over 3.8 million of them succumbing to the disease. An SIR model-based investigation about the propagation of the disease has been carried out by Saxena et al. in [1].

The primary problem of the COVID-19 disease is the high incubation period of the virus ranging from few days to up to multiple weeks, and in some cases, we come across asymptomatic patients as well. Due to this, the person acts as an active carrier of the disease, spreading it to other people in their vicinity unknowingly during this period. The applications of technology, such as proposed in the works [2], [3], [4], [5] for monitoring, biomedical imaging and early detection of disease have had a positive impact on the medical field. Applications of research such as the work [6] have helped in proper social distancing measures. Though conventional detection methods like Reverse Transcription Polymerase Chain Reaction (RT-PCR) from a nasopharyngeal swab has proved to be highly effective [7], the time taken by such methods is high and there are quite a few false positives as in the results of the work [8]. Hence, Computer Aided Detection (CADe) has been looked into as an alternative and viable solution.

CADe is a sub-field of the Biomedical Image Analysis domain, which is one of rapidly growing interdisciplinary research fields that includes biology, engineering and medicine. It is concerned with measurements of the human body on macroscopic and microscopic scales. The core part of this research field is the application of image processing methodologies in order to solve various medical problems of the human bodies. As biomedical images contain important information about the anatomical structure of the affected body parts, it would be extremely useful for proper detection, thus it assists the medical experts for better treatment of the patients.

Generally, medical experts analyse such images manually and apply their experience to understand the severity of the disease. However, it can be easily understood that such manual analysis of these images by the medical professionals is limited owing to differences in interpersonal interpretation capability among others, which make this analysis a subjective matter. On the contrary, computer-based automated investigation of biomedical images favours objective analysis, thus leading to the better diagnosis of the patients. Such systems can make the diagnosis more economical and less time-consuming which is the one of the basic needs of the developing nations.

It is notable that in recent years, CADe has proved to be very successful for biomedical purposes. It has been used for detection of pulmonary disorders, coronary artery disease, Alzheimer’s disease and other such diseases. For COVID-19, CADe based methods focus on analysing the Chest X-ray (CXR) or chest Computed Tomography (CT) Scan images for detecting the presence of COVID-19. The sample CXR images for COVID-19, pneumonia and normal patients are shown in Fig. 1. More recently, alternative modalities such as Scattergram images [9] have also found success in COVID-19 CADe.

Fig. 1 — Sample Images of chest X-rays for all three classes in the COVID-X dataset.

Deep learning has shown rapid improvement in CADe based treatment in various fields, the latest being COVID-19. Quite a few attempts have been made to develop a robust system capable of efficiently detecting COVID-19 in a person such as in the works [10], [11], [12], [13], [14], [15] Most of these have utilized deep learning due to its high efficiency in recent years. Specifically speaking, Convolutional Neural Networks (CNNs) have been used in most cases due to the fact that they have obtained great success in recent years for classifying radiological images. Further, deep CNNs also do not need to be fed handcrafted features using feature engineering due to which they are preferred over conventional machine learning classifiers. They have also proved to be more effective in image classification in general than most other methods due to which most researchers resort to it for classification purposes for any category of images.

1.1. Motivation

In this paper, we propose a CADe framework which benefits from the combined prediction abilities of CNN models. The entire steps of the proposed work are summarized in Fig. 2.

Fig. 2 — Schematic diagram of our proposed methodology which consists of: (I) Preprocessing input biomedical images to conform to expected input of standard CNN architectures, (II) Classification using three CNNs leveraging Transfer Learning, and (III) Ensemble of classifiers using Choquet fuzzy integral to yield prediction, available for medical practitioners.

Initially, we process the acquired images to be of uniform shape. This is necessary to harness the standard CNN architectures as feature extractors. A large body of methods have investigated the use of transfer learning for CNN classifiers. We employ three such state-of-the-art CNN architectures to generate decision scores based on the processed inputs.

Owing to the stochastic learning process of deep learning models, the decision scores generated by CNNs contain a degree of uncertainty. Each of the constituent models converges at a particular local minimum of the loss function used, as a result of the particular gradient descent algorithm used for training. The imperfectly converged models, as well as noises in the sampled observations upon which they are trained lead to the uncertainty in the predictions. To counter this, we rely on the principles of fuzzy logic to harness this degree of uncertainty and use it effectively to generate our final predictions. This is done by an efficient ensembling strategy which uses fuzzy logic principles to combine the results of the individual classifiers weighing them in accordance with their scores. Fuzzy logic performs exceptionally well in situations wherein decisions are made upon imprecise information. We investigate the Choquet integral for the classifier ensemble through fuzzy integrals, which works as a generalization of previously explored empirical schemes. It additionally supports conditioning the weightage of each classifier at inference based upon the decision scores of prior classifiers in the ensemble process. Specifically, it caters to the fact that the biomedical images may contain crucial information which is too specific to be detected by a particular CNN of the network, which is important for other details.

1.2. Contributions

We have chosen a unique combination of three CNN based classifiers such that the outputs complement each other appropriately while generating decision scores upon CXRs or CT Scans. Transfer learning is used to reduce training time as well as increase the efficiency of the networks. An ensemble method is employed using an efficient strategy based on the Choquet Fuzzy Integral method, and the performance obtained is compared with empirical ensemble strategies. We have achieved impressive performance on multiple COVID-19 image datasets through the ensemble framework, wherein the result achieved is beyond the reach of the individual classifiers. Appreciable performance on multiple datasets in varied fields of medical imaging using a variety of metrics along with a detailed ablation study, K-fold cross validation results, and a comparative study with other methods demonstrate the robustness of the network. Overall, the method is an unique combination of both new as well as a few existing research topics which generates desirable results and mostly outperforms its predecessors.

2. Related work

Among the recently proposed CADe methods for COVID-19, there are two sources of medical images — CXRs and CT scans. The authors of [16] had utilized an ensembling approach via majority vote on classical machine learning models, using texture features extracted from the X-ray images. A hierarchical classification methodology for the multiple class problem had been utilized.

CNN based classifiers have been a popular choice for CADe in recent literature. In the work [17], the authors had adapted the Darknet-19 CNN architecture from YOLO object detection to work on X-ray scans, with evaluation of activation maps generated in the training process by an expert radiologist. In the work [18], the authors leveraged a CNN based architecture where the design of the network was explored through generative synthesis, a machine-driven exploration strategy.

The principle of transfer learning has been extensively explored when utilizing CNNs. It is useful for the application of deep learning in various domains, as in the work [19]. The authors of [20] had utilized a transfer learning based approach by harnessing the architecture and saved weights of state-of-the-art CNN classifiers on the ImageNet benchmark. In the study [21], the authors had investigated a large number of state-of-the-art CNN models with transfer learning along with image augmentation to enhance the limited number of X-ray samples. In the work [22], the ResNet-50 architecture was fine-tuned by progressively resizing, and data augmentation techniques were utilized. In [23], an Xception architecture based CNN transfer learning method had been utilized. In [24], a hierarchical classification methodology for a multi-class approach had been utilized. This multi-stage cascaded disease classification also leveraged transfer learning using CNNs. In [25], transfer learning had been utilized via the SqueezeNet CNN architecture, along with Bayesian optimization and data augmentation.

In the work [26], VGG architecture based CNNs with transfer learning had been ensembled with the empirical late fusion strategy of stacking. In the work [27], a capsule-based network had been used, a similar approach to CNNs, however including a “routing by agreement” component which was utilized to combine different capsules and identify spatial relations. Transfer learning was utilized on an X-ray based dataset. The authors of [28] proposed an ensemble approach exploring several empirical fusion schemes upon transfer learning based CNNs which were pruned for optimal hyperparameters.

Fuzzy logic is a natural choice for the ensemble of classifiers, given the uncertainty of decision scores from each of the learners. The principles of fuzzy measures and fuzzy integrals were first introduced in the work [29]. Building on those ideas, the authors of [30], introduced $λ$ -measures. These state that the sum of all interactions of sources is 1, allowing the efficient calculation of $λ$ . In the study [31], the authors had introduced the concept of the Choquet integral as a non-linear aggregation in the form of the generalization of product and addition rules, two empirical rules for classifier ensembles.

The concept of fuzzy integrals has been used to solve a variety of pattern recognition problems across various domains including human action recognition [32].

2.1. Research gaps

The existing supervised classification algorithms using machine learning, such as used in the work [16], are unable to harness the data-rich image modalities as effectively as the deep learning strategies such as the CNN. This is because in conventional machine learning, the features mostly need to be handcrafted and fed to the model, whereas in deep learning, the features are automatically extracted by the network such that its purpose is best suited. Thus, the handcrafted features are mostly not as efficient as those extracted by the deep learning architectures, which are delineated by the ability to learn complex representations from the image data, without any feature engineering by the researchers. The ensemble strategy used in the classical models, however, can also be enhanced by the proposed fuzzy ensemble framework in such cases where class probabilities are generated by the models.

While multiple works [20], [21], [23], [24], [25], [26], [27], [28], [33] have utilized transfer learning on standard CNN architectures for CADe, most of them do not harness complementary constituent base classifiers for the ensemble. Moreover, the ensemble strategies used do not support the dynamic refactoring of weights at inference time, and are mostly static which affects performance to an extent. The dynamic refactoring using the principles of fuzzy logic as introduced in the work [29] have been utilized as part of the proposed framework.

3. Proposed method

We approach the problem of COVID-19 CADe from biomedical images as a multi-class classification setting. The classifiers used in our approach are state-of-the-art CNNs, which are further supported by leveraging transfer learning to utilize the knowledge of existing pre-trained models. Besides, the principles of fuzzy logic have been used as a classifier combination technique, specifically based upon the Choquet fuzzy integral.

3.1. CNN classifiers

In this paper, the pre-trained convolutional blocks and the weights of some standard CNN architectures are utilized, followed by a deep learning based classifier which is trained end-to-end. The training phase involves fine-tuning the convolutional feature extractors, and training the classifier which is accelerated by the saved weights of the convolutional layers. This strategy has been followed to leverage the effective convolutional feature extractors with knowledge mined from the ImageNet dataset, as well as the reduced computational complexity leading to faster training.

We utilize the strategy of transfer learning to fine-tune pre-trained CNN models which were originally trained on the task of ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset, to work on COVID-19 detection from CXRs. Thus, owing to the large size of the ImageNet dataset, the classifier is well-trained to recognize certain low-level features from the biomedical images even before being trained on them. This is also helpful when the dataset of the necessary domain is of limited size, as the knowledge from other domains upon which the networks are already trained on can be transferred to overcome the limitations of data in terms of size.

The CNN models are originally trained on the ImageNet dataset. However, the CXRs used in this study are of varying dimensions, while ImageNet input images are of the dimension 224 × 224. Subsequently, it is necessary to resize the images to be of a compatible dimension. Hence, at the boundaries of the images, black borders are added to ensure that they conform to a square input.

The same architecture of the original model has been adopted, except the layers following the convolutional layers. The weights of the convolutional layers are frozen, and additional layers are added after the feature extractors. The block diagram of the layers following the feature extractors is shown in Fig. 3.

Fig. 3 — Proposed classification network following state-of-the-art CNN architectures.

The last layer of each of the CNN models comprises of the Softmax activation function, which is defined by the following formula:

q_{c} = \frac{e^{z_{c}}}{\sum_{c = 1}^{C} e^{z_{c}}}

(1)

The output of this layer represents a probability distribution over the predicted output classes, which we refer to as the confidence score generated by the classifier.

An overview of each of the classifiers used in the present work has been provided in the upcoming subsections.

3.1.1. Inception V3

The inception module was proposed by the authors of [34] in 2014. This first version involves using multiple filter sizes on the same convolutional level, hence increasing the width of the network. The idea behind this was to negate the effect of the size of the object in question and improve the efficiency of the localization of information. It also includes an extra 1 × 1 convolution for dimensionality reduction. The GoogLeNet architecture proposed in the same paper includes nine such inception modules.

Inception V2 and V3 were proposed in the work [35]. Inception V2 includes certain enhancements to improve the computational speed of the network such as factorization of layers and expansion of the filter banks to make the network wider instead of deeper. Inception v3 uses RMSProp Optimizer, batch normalization in the auxiliary classifiers and label smoothing as well to improve performance of the network. For our purpose, we use this version of Inception (i.e., Inception V3) while using the SGD optimizer. The architecture of our Inception classifier consists of 11 separate inception modules stacked linearly, each of which consists of four separate networks at the first level. Each network consists of a series of convolutional, batch normalization and pooling layers of varying sizes. As mentioned before, the sizes of the filters for the layers in concern are appropriately factored and 1 × 1 convolutions are added. Factorization such as breaking up a $n \times n$ filter into a $1 \times n$ and $n \times 1$ filter helps in reducing the time taken by the network and improves the performance of the network by a great margin. However, by altering the dimensions of a network drastically, crucial information may be lost. To handle this, the filter banks in the network are expanded so that the representational bottleneck is removed and the network is further wider than deeper. The four separate networks are then concatenated to get a single output which is then fed into the next module. The output from the final module gives the result.

3.1.2. Inception ResNet V2

Inspired by the performance of the ResNet, researchers came up with a hybrid model that took into account both the Inception and the ResNet models. Thus, Inception-ResNet was proposed in the work [36]. Inception-ResNet involves residual connections that feed the output of the convolutional layer to the input. It also includes explicit reduction blocks that are used to change the height and width of the grid.

In Inception ResNet V2, inception modules are used and we add residual connections that combine the convolution output of the inception module to the input. For this to work, the two of these must have the same dimensions. Hence 1 × 1 convolutions are used after the original convolutions to match the depths of the two. Pooling is replaced by residual connections, and the pooling is performed by the residual blocks as and when required. The residual activations are scaled appropriately to ensure that the network does not die out. The stem and hyperparameter settings are in line with Inception ResNet V2 as mentioned in the original paper.

3.1.3. DenseNet 201

DenseNet was proposed in the work [37] as an advancement over traditional CNNs. In conventional models, subsequent layers are connected by just one connection from its preceding layer in a feedforward fashion. DenseNet, however, exploits the outputs of all previous layers and uses them as input to generate the output for the current layer. This eliminates the vanishing gradient problem to a large extent while facilitates easier flow of information among the layers of the network and hence, the network needs fewer parameters to train.

For our purpose, we use the version DenseNet 201 where 201 indicates the number of layers in the network. The architecture starts off with conventional convolution and pooling layers followed by three sequences of dense blocks and transition layers. This is followed by a dense block and classification layer which gives us the output. As mentioned previously, in a dense block, each layer receives inputs from all previous layers. The difference with its counterparts lies in the number of convolutional layers in the third and fourth dense blocks. Overall, there are 201 layers in the network.

3.2. Fuzzy integral based classifier fusion

The fuzzy integral has obtained considerable success in the process of combining the classifiers’ outputs in various pattern recognition problems [32]. Fuzzy integrals exploit the decision scores obtained from individual classifiers as means to effectively produce a final output. The effectiveness comes as a result of the output being a set of confidence scores instead of singleton values. These scores are subsequently being combined with some measures for each classifier, with the measures assigned beforehand according to prior results. The combination process also allows a dynamic refactoring of weights for each classifier, dependent upon the scores.

As per the work reported in [29], the fuzzy measure concerned in our case lies in the range $[0, 1]$ . Formally, a fuzzy measure is a real valued set function.

Each of the constituent CNN classifiers is responsible for generating a distinct confidence score. If the confidence scores are given by $C = {c_{1}, c_{2}, \dots, c_{N - 1}, c_{N}}$ with N denoting the total number of scores, and $g \subseteq C$ , we can infer that the fuzzy measure is a function $f : 2^{N} \to [0, 1]$ , with $f (ϕ) = 0$ , $f (C) = 1$ . As a matter of fact, the following formula holds monotonically:

g_{i} \subset g_{j} \Rightarrow f (g_{i}) \leq f (g_{j}) .

(2)

The identification of $2^{N}$ fuzzy measures as per the classic approach is a learning problem that scales exponentially with respect to the parameter $N$ , the number of information sources.

The concept of a specific type of measure is presented in the work [30]. It is known as the Sugeno fuzzy- $λ$ measure with an additional characteristic that if $g_{i} \cap g_{j} = ϕ$ , there exists $λ > - 1$ , where —

f (g_{i} \cup g_{j}) = f (g_{i}) + f (g_{j}) + λ f (g_{i}) (g_{j}) .

(3)

From the previous definitions, we can find the value of $λ$ by solving the following equation —

λ + 1 = \prod_{n = 1}^{N} (λ f (g_{n}) + 1),

(4)

where, $N = 3$ in our case, as each model generates a set of scores. So $λ$ is the real root of a quadratic equation which is $> - 1$ .

Hence, there is a need to identify only $N$ fuzzy measures instead of $2^{N}$ , as $λ$ can be used to generate fuzzy measures for all coalitions of the inputs through Eq. (2). The reduction in the search space offered by Sugeno fuzzy- $λ$ measures enables experimental determination of measures to be a computationally feasible strategy.

The Choquet integral described in the study [31] can be utilized to implement all linear algebraic combinations such as sum and product to be used as a generalized combination of empirical ensemble strategies such as average and multiplication. It is a form of a non-linear aggregation operation. Its performance is dependent on the choice of fuzzy measures. Inferring from the trivial definition of integration operator, it can be expanded as

I_{f} (C) = \sum_{n = 1}^{N} c_{π_{n}} [f (g_{π_{i}}) - f (g_{π_{i - 1}})],

(5)

where, the set of scores $C$ is permuted to $C_{π}$ such that

c_{π_{1}} \geq c_{π_{2}} \geq, \dots, \geq c_{π_{N - 1}} \geq c_{π_{N}},

(6)

and $g_{π_{i}}$ is the subset of the $i$ highest scores in $C_{π}$ given by Sugeno fuzzy- $λ$ measures. $I_{f} (C)$ is used to generate $f z p r e d$ in Algorithm 1.

Choquet integral utilizes both the fuzzy weight assigned to a classifier score along with the confidence of the score itself. It can be inferred that $f (g_{π_{i}})$ depends upon $f (g_{π_{i - 1}})$ . Algorithm 1 entails many decisions in the entire process based on the different confidence scores, leading to a sensitive and exhaustive decision making process based on the coalitions of input scores which is proven to be much more effective than normal softmax probabilities. The time complexity of the process is $O (M \times ((N) log (N)))$ , with $N$ representing the number of classifiers and $M$ representing the number of classes.

Among empirical ensemble based methods, the unweighted averaging scheme is the most commonly used. This has the natural advantage of reducing the variance of CNN classifiers, as deep learning based stochastic methods have high variance and low bias. However, when the ensemble network consists of heterogeneous learners such as in our case, even if the classifiers have a comparable performance, the unweighted averaging scheme is vulnerable to the situation in which a weak learner is given higher weightage, or when an overconfident candidate leads to incorrect predictions.

The weighted average ensemble strategy is another empirical strategy, wherein the weights for each learner is determined experimentally. This allows a degree of adaptive combination of learners, such as the case in which a weaker learner might be good at predicting certain classes.

However, the determination of the weights in this strategy is a one-time process. There is no opportunity to fine-tune or update these weights at the inference time, hence this strategy is not dynamic, unlike the principle of fuzzy fusion. To be specific, the fuzzy fusion allows fine-tuning of these weights for each classifier on the fly, and does so on the basis of the predictions for each individual sample of data. Subsets or coalitions of multiple classifier predictions are processed with their corresponding fuzzy measures at intermediate stages of the ensemble strategy. Thus, there is a scope for further refinement even after the fuzzy measures have been determined, unlike the weights in typical averaging methods.

4. Results and analysis

We have performed several experiments upon multiple datasets which demonstrate the robustness of our proposed method. In this section, we discuss empirical details about our method and interpret the results which we have obtained.

4.1. Data description

The proposed method is used upon four medical imaging datasets. The detail description of these datasets, used in the present work, are highlighted in the following subsections.

4.1.1. COVID-X (CXRs)

The database, namely COVID-X, introduced in the work [18], has been utilized in this paper. To the best of our knowledge, this is the largest open access COVID-19 X-ray dataset which is currently available, consisting of $15471$ CXR images. This dataset has been generated by merging five different repositories of chest X-ray scans. It consists of three different classes of scans — COVID-19 positive patients, pneumonia infected patients, and normal patients. The distribution of data used in this current work is shown in Table 1. We have utilized the train-test split as per the labels provided by the authors.

Table 1.

Class-wise distribution of CXR samples in the COVID-X dataset.

Phase	COVID-19	Pneumonia	Normal	Total
Train	468	5458	7966	13892
Test	100	594	885	1579

Method	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Specificity (%)	FPR (%)	AUC (%)	MCC (%)	McNemar’s Test $p$ -value
DenseNet 201	98.79	98.38	99.18	98.78	98.40	1.59	98.79	97.58	–
Inception v3	97.18	97.02	97.28	98.92	97.07	2.92	97.18	94.36	0.0247
Inception ResNet v2	97.58	97.30	97.83	97.56	97.34	2.65	97.58	95.16	0.0323
ResNet 152 v2	96.24	96.20	96.20	96.20	96.27	3.72	96.24	92.48	0.0085
EfficientNet B7	97.44	98.88	95.93	97.38	98.93	1.06	97.43	94.93	0.0441
Xception	97.18	98.06	96.20	97.12	98.13	1.86	97.17	94.37	0.0247
VGG 19	97.71	97.82	97.56	97.69	97.87	2.12	97.71	95.43	0.05330

		Predicted
		$C O V I D - 19$	$N o n C O V I D - 19$
True	$C O V I D - 19$	365	11
True	$N o n C O V I D - 19$	10	359

CNN Model	McNemar’s Test $p$ -value
Inception v3	0.0360
Inception ResNet v2	0.0442
DenseNet 201	0.0483

Distribution $P$	Distribution $Q$	$D_{K L} (P \| \| Q)$	$D_{J S} (P \| \| Q)$
Inception V3	DenseNet 201	0.3452	0.1020
DenseNet 201	Inception V3	0.2202	0.1020

Inception ResNet v2	DenseNet 201	0.3245	0.1262
DenseNet 201	Inception ResNet v2	0.2506	0.1262

Inception V3	Inception ResNet v2	0.3330	0.0983
Inception ResNet v2	Inception V3	0.3100	0.0983

CNN Model	Accuracy (%)	Fuzzy Measure
Inception v3	95.06	0.038
Inception ResNet v2	94.62	0.015
DenseNet 201	95.88	0.074

CNN Model	Accuracy (%)	Fuzzy Measure
Inception v3	99.36	0.030
Inception ResNet v2	99.36	0.043
DenseNet 201	99.36	0.026

Ensemble Method	Accuracy (%) 3-Class	Accuracy (%) 2-Class
Maximum	94.68	99.36
Multiplication	95.22	99.36
Average	95.87	99.41
Weighted Average	96.20	99.46

Fuzzy	96.39	99.49

		Predicted
		$C O V I D - 19$	$N o r m a l$	$P n e u m o n i a$
	$C O V I D - 19$	95	5	0
True	$N o r m a l$	0	870	15
	$P n e u m o n i a$	2	35	557

Metric (%)	COVID-19	Normal	Pneumonia	Overall
Accuracy	99.56	96.51	96.70	96.39
Precision	97.94	95.60	97.38	96.97
Recall	95.00	98.30	93.77	95.69
F1-Score	96.44	96.93	95.54	96.30
Specificity	99.86	94.24	98.47	97.52
FPR	0.135	5.764	1.523	2.474
AUC	97.43	96.27	96.12	–
MCC	96.22	92.95	92.97	93.31

		Predicted
		$C O V I D - 19$	$N o r m a l$	$V i r a l P n e u m o n i a$
	$C O V I D - 19$	120	0	0
True	$N o r m a l$	0	133	1
	$V i r a l P n e u m o n i a$	0	1	133

Metric (%)	COVID-19	Non COVID-19	Overall
Accuracy	99.49	99.49	99.49
Precision	98.94	99.52	99.23
Recall	93.00	99.93	96.46
F1-Score	95.88	99.73	97.80
Specificity	93.00	99.93	96.46
FPR	0.068	7.00	3.53
AUC	96.46	96.46	–
MCC	95.66	95.66	95.66

Metric (%)	COVID-19	Normal	Viral Pneumonia	Overall
Accuracy	100.00	99.46	99.46	99.49
Precision	100.00	99.25	99.25	99.50
Recall	100.00	99.25	99.25	99.50
F1-Score	100.00	99.25	99.25	99.50
Specificity	100.00	99.61	99.61	99.74
FPR	0.00	0.394	0.394	0.262
AUC	100.00	99.43	99.43	–
MCC	100.00	98.86	98.86	99.22

Metric (%)	Normal	Tuberculosis	Overall
Accuracy	96.43	96.43	96.43
Precision	94.11	100.00	97.06
Recall	100.00	91.67	95.83
F1-Score	96.97	95.65	96.31
Specificity	91.67	100.00	95.83
FPR	8.333	0.000	4.167
AUC	95.83	95.83	–
MCC	92.88	92.88	92.88

Fold Number	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Specificity (%)	FPR (%)	AUC (%)	MCC (%)
1	99.80	100.00	99.59	99.79	100.00	0.000	99.79	99.59
2	99.60	100.00	99.19	99.59	100.00	0.000	99.59	99.20
3	99.79	99.59	100.00	99.79	99.60	0.0040	99.80	99.59
4	99.59	99.19	100.00	99.59	99.20	0.0079	99.60	99.35
5	99.60	98.38	99.60	99.60	99.60	0.0040	99.60	99.49

Average	99.68	98.38	99.68	99.68	99.68	0.0032	99.68	99.44

Method	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Specificity (%)	FPR (%)	AUC (%)	MCC (%)
COFE-Net	98.93	98.40	99.46	98.92	98.40	1.59	98.93	97.86
Inception V3	97.18	97.02	97.28	98.92	97.07	2.92	97.18	94.36
Inception ResNet V2	97.58	97.30	97.83	97.56	97.34	2.65	97.58	95.16
DenseNet 201	98.79	98.38	99.18	98.78	98.40	1.59	98.79	97.58

		Predicted
		$N o r m a l$	$T u b e r c u l o s i s$
True	$N o r m a l$	16	0
True	$T u b e r c u l o s i s$	1	11

Method	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Specificity (%)	FPR (%)	MCC (%)
Inception V3	95.06	96.77	94.01	95.30	95.50	3.491	90.86
Inception ResNet V2	97.58	93.07	91.81	92.42	96.59	3.41	90.05
DenseNet 201	95.88	96.03	95.02	95.50	97.22	2.77	92.38

Inception V3 and Inception ResNet V2	95.95	94.88	94.42	94.65	97.44	2.557	92.50
Inception V3 and DenseNet 201	96.07	96.76	95.15	95.92	97.30	2.700	92.73
Inception ResNet V2 and DenseNet 201	96.20	95.28	94.36	94.81	97.60	2.393	92.96

Ensemble of all three	96.39	96.97	95.69	96.30	97.52	2.474	93.31

Method	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Specificity (%)	FPR (%)	AUC (%)	MCC (%)
Inception V3	99.36	98.67	95.93	97.25	95.93	4.06	95.93	94.56
Inception ResNet V2	99.36	99.15	95.46	97.22	95.46	4.53	95.46	94.56
DenseNet 201	99.36	98.67	95.93	97.25	95.93	4.06	95.93	94.56

Inception V3 and Inception ResNet V2	99.36	99.15	95.46	97.22	95.46	4.534	95.46	94.54
Inception V3 and DenseNet 201	99.43	98.71	96.43	97.54	96.43	3.56	96.43	95.11
Inception ResNet V2 and DenseNet 201	99.43	99.19	95.96	97.51	95.96	4.03	95.96	95.10

Ensemble of all three	99.49	99.23	96.46	97.80	96.46	3.53	96.46	95.66

Method	Data Distribution	Accuracy (%)
COVID-Net [18]	358 COVID-19, 5538 Pneumonia, 8066 Normal	93.3
COVID-ResNet [22]	68 COVID-19, 1591 Pneumonia, 1203 Normal	96.23
COVID-CAPS [27]	Not specified	98.3
COVIDiagnosis-Net [25]	76 COVID-19, 4290 Pneumonia, 1583 Normal	98.26

COFE-Net	568 COVID-19, 6052 Pneumonia, 8851 Normal	96.39
COFE-Net	568 COVID-19, 14903 nonCOVID-19	99.49

Method	Data Distribution	Accuracy (%)
Transfer Learning Dataset 1 [20]	224 COVID-19, 700 Pneumonia, 504 Normal	93.48
Transfer Learning Dataset 2 [20]	224 COVID-19, 714 Pneumonia, 504 Normal	94.72
Majority Voting ML [16]	782 COVID-19, 782 Pneumonia, 782 Normal	93.41
DenseNet201 [21]	423 COVID-19, 1485 Pneumonia, 1579 Normal	97.94
Cascaded CNNs [24]	69 COVID-19, 79 Bact. Pneumonia, 79 Viral Pneumonia, 79 Normal	99.9
CoroNet Dataset 1 [23]	284 COVID-19, 657 Pneumonia, 310 Normal	95.0
CoroNet Dataset 2 [23]	157 COVID-19, 500 Pneumonia, 500 Normal	90.21
Stacked VGG Ensemble [26]	219 COVID-19, 1345 Pneumonia, 1341 Normal	97.4
Pruned Weighted Average [28]	313 COVID-19, 8792 Pneumonia, 7595 Normal	99.01

COFE-Net	568 COVID-19, 6052 Pneumonia, 8851 Normal	96.39

Method	Data Distribution	Accuracy (%)
Transfer Learning Dataset 1 [20]	224 COVID-19, 1204 nonCOVID-19	98.75
Transfer Learning Dataset 2 [20]	224 COVID-19, 1214 nonCOVID-19	96.78
DenseNet201 [21]	423 COVID-19, 3064 nonCOVID-19	99.70
CoroNet Dataset 1 [23]	284 COVID-19, 967 nonCOVID-19	99.0
DarkCovidNet [17]	127 COVID-19, 500 nonCOVID-19	98.08
Majority Voting ML [16]	782 COVID-19, 1564 nonCOVID-19	98.06
Stacked VGG Ensemble [26]	219 COVID-19, 2686 nonCOVID-19	99.48
Class Decomposition [41]	116 COVID-19, 80 nonCOVID-19	97.35

COFE-Net	568 COVID-19, 14903 nonCOVID-19	99.49

Method	Accuracy (%)
VGG 19 [20]	93.00
Transfer Learning [42]	98.29
AlexNet [43]	$97.59 \pm 0.60$

COFE-Net	99.49

Method	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Specificity (%)
xDNN [44]	88.60	89.70	88.60	89.15	–
Transfer Learning [45]	94.04	95.00	94.00	94.50	95.86
Bi-stage FS [13]	95.32	95.30	95.30	95.30	–
DenseNet 201 [46]	96.25	96.29	96.29	96.29	96.21
KarNet [47]	97.00	95.00	98.00	97.00	95.00
Gabor Ensemble [48]	97.40	99.10	95.50	97.30	–

COFE-Net	98.93	98.40	99.46	98.92	98.40

Method	Accuracy (%)
FRCNN [49]	92.60
HDHFS [50]	92.70
HCDEL [51]	93.47
VoPreCNNFT [52]	97.50

COFE-Net	96.43

PERMALINK

COFE-Net: An ensemble strategy for Computer-Aided Detection for COVID-19

Avinandan Banerjee

Rajdeep Bhattacharya

Vikrant Bhateja

Pawan Kumar Singh

Aime’ Lay-Ekuakille

Ram Sarkar

Abstract

1. Introduction

Fig. 1.

1.1. Motivation

Fig. 2.

1.2. Contributions

2. Related work

2.1. Research gaps

3. Proposed method

3.1. CNN classifiers

Fig. 3.

3.1.1. Inception V3

3.1.2. Inception ResNet V2

3.1.3. DenseNet 201

3.2. Fuzzy integral based classifier fusion

4. Results and analysis

4.1. Data description

4.1.1. COVID-X (CXRs)

Table 1.

4.1.2. COVID-19 Radiography Database (CXRs)

4.1.3. SARS-COV-2 CT Scan Dataset

4.1.4. Montgomery Dataset (CXRs)

4.2. Performance metrics

4.3. Training configuration

4.4. Choice of constituent learners

Table 2.

Table 3.

Table 4.

Table 5.

Table 6.

Table 7.

4.5. Comparison of Fuzzy ensemble

Table 8.

Table 9.

Table 10.

4.6. Performance

4.6.1. COVID-X CXR dataset

Table 11.

Table 12.

Table 13.

Table 14.

4.6.2. COVID-19 Radiography Database

Table 15.

Table 16.

4.6.3. SARS-COV 2 CT Scan Dataset

Table 17.

Table 18.

Fig. 4.

4.6.4. Montgomery Dataset

Table 19.

Table 20.

4.7. Ablation study

Table 21.

Table 22.

Table 23.

4.8. K-Fold cross validation

Table 24.

Fig. 5.

Table 25.

4.9. Comparison with some past methods

4.9.1. COVID-X

Table 26.

Table 27.

Table 28.

4.9.2. COVID-19 Radiography Database

Table 29.

4.9.3. SARS-COV-2 CT scans

Table 30.

4.9.4. Montgomery dataset

Table 31.

4.10. Error analysis

Fig. 6.