Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2023 Feb 17;11:100215. doi: 10.1016/j.rico.2023.100215

A novel ensemble CNN model for COVID-19 classification in computerized tomography scans

Lúcio Flávio de Jesus Silva a, Omar Andres Carmona Cortes b,, João Otávio Bandeira Diniz c
PMCID: PMC9936787

Abstract

COVID-19 is a rapidly spread infectious disease caused by a severe acute respiratory syndrome that can lead to death in just a few days. Thus, early disease detection can provide more time for successful treatment or action, even though an efficient treatment is unknown so far. In this context, this work proposes and investigates four ensemble CNNs using transfer learning and compares them with state-of-art CNN architectures. To select which models to use we tested 11 state-of-art CNN architectures: DenseNet121, DenseNet169, DenseNet201, VGG16, VGG19, Xception, ResNet50, ResNet50v2, InceptionV3, MobileNet, and MobileNetv2. We used a public dataset comprised of 2477 computerized tomography images divided into two classes: patients diagnosed with COVID-19 and patients with a negative diagnosis. Then three architectures were selected: DenseNet169, VGG16, and Xception. Finally, the ensemble models were tested in all possible combinations. The results showed that the ensemble models tend to present the best results. Moreover, the best ensemble CNN, called EnsenbleDVX, comprising all the three CNNs, provides the best results achieving an average accuracy of 97.7%, an average precision of 97.7%, an average recall of 97.8%, and an F1 average score of 97.7%

Keywords: COVID-19, CT image, Transfer learning, Ensemble, CNN

1. Introduction

The corona-virus outbreak in 2020 (COVID-19) causes a severe acute respiratory syndrome (SARS-CoV-2) that has led to millions of people being infected and thousands of deaths worldwide. The World Health Organization (WHO) released the Situation Report registering 567 million confirmed cases and over 6.3 million deaths worldwide by July 27, 2022 [1]. Moreover, the manifestations of COVID-19 go beyond lung problems causing heart infections as well, resulting in fatalities, which seriously threaten the health of the entire world [2]. The severity of the illness and its outbreak caused health systems to collapse and questioned society’s preparedness in the face of unknown virus pandemics, where immediate diagnosis and isolation could prevent a mass spread.

One way of helping to diagnose COVID-19 is by using Computerized Tomography (CT), a diagnostic imaging exam consisting of an image representing a section or “slice” of the body. It is obtained through computer processing collected after exposing the body to a succession of X-rays. Its main method is to analyze an X-ray beam’s attenuation as it travels through a segment of the body [3].

Among tomographic images’ characteristics, pixels, matrix, a field of view, grayscale, and windows stand out. The pixel is the smallest point that can be obtained in an image. The greater the number of pixels in a matrix, the better its spatial resolution, which allows for better spatial differentiation between structures; consequently, obtaining a more precise diagnostic is essential.

Thus, due to the development of Computer-Aided Disease Diagnostic Systems associated with machine learning and deep learning algorithms, the ability to detect the presence or absence of a disease of interest faster is a reality. Hence, deep learning algorithms, mainly Convolutional Neural Networks (CNNs), have been successfully used to process and analyze digital images, including different kinds of medical images, including CTs. Even though several works have presented results using X-ray images, CT scans tend to be more efficient due to two reasons. Firstly, a CT scan gives a detailed 3-dimensional view of the diagnosed organ, whereas X-rays give a 2-D view. Secondly, the CT scan does not overlap organs, whereas, in X-rays, ribs overlap the lungs and heart [4].

Therefore, studying new classification algorithms using CT scans, especially CNN architectures, in the analysis and classification of medical images can help professionals and researchers develop new systems, provide more precise diagnoses, and start the treatment as soon as possible. In this context, we propose to use ensemble learning through the combination of CNN’s features to obtain more precise metrics such as accuracy, precision, recall, and F1 Score. Ensemble systems have proven to be efficient and highly versatile in various problem domains, and real-world applications [5].

Initially, ensemble systems were developed to reduce the variance of decision-making systems. However, due to their success in dealing with machine learning problems, they have received attention from the research community. In this context, we propose combining the models using a fully connected neural network whose exit is a pair of softmax neurons. Furthermore, we tested 11 CNN architectures to decide which one to combine to create the best ensemble model. We combined the best three of them, summing up four ensemble models.

In this context, we ensemble three parallel CNNs using transfer learning with fine-tuning contributing as follows.

  • We propose an ensemble model that uses an Artificial Neural Network (ANN) in a level-1 stacking model instead of a voting system.

  • We provide a snapshot of deep learning models using CT scans, which guided the choice of the CNN models.

  • We evaluated our proposal experimentally, showing promising results.

Hence this paper is divided as follows: Section 2 shows related works; Section 3 illustrates the concepts involved in this work and shows our proposal and how classifications are decided; Section 4 presents the experimental setup, including information about the dataset, the results of the experiments, and it discusses the importance of the study, the final results, and limitations of the approach; finally, Section 5 shows the conclusions and future work.

2. Related work

The state-of-art CNN architectures, typically used for image classification, were usually pre-trained using the ImageNet dataset [6]. Moreover, these state-of-art architectures became available in Python libraries such as TensorFlow and Keras, making their applications popular in different fields. We can cite several works using them as classification tools in biomedical applications such as [7], [8], among many others. Specifically, in COVID-19 applications, some works using CNNs for classification can be remarked.

Yang et al. [9] tested DenseNet169 and ResNet-50 is some different CT datasets, reaching the best results of 61.1% precision and 86.9% f1-score. Li et al. [10] proposed a CNN called COVNet, a ResNet50-based neural network, for identifying COVID-19 in CT scans. They achieved a Recall of 90% and specificity of 96% for detecting COVID-19. Xu et al. [11] developed a system to identify patients with COVID-19 using a CNN called DRE-Net, also a ResNet50-based CNN, and compare it against VGG16, DenseNet, and ResNet; however, they do not specify which ResNet nor which DenseNet was used. They reached a precision of 79%, a recall of 96%, an f1-Score of 87%, and an accuracy of 86%. Xu et al. [12] proposed a ResNet-18 connected to another proposed CNN with eight layers, then compared the results against the traditional ResNet-18. The proposal reached a precision of 76.5%, a recall of 68,9%, and an f1-score of 72.5%. Ashour et al. [13] use an ensemble bag-of-features to classify COVID-19 in X-ray images.

Even though works using non-ensemble methods reached good results for a new and barely known disease, researchers can seek better results using combined methods, the so-called ensemble learning algorithms. This approach has been successfully applied in medical and biomedical fields. Zheng et al. [14], for example, join deep learning and ensemble learning to extract a category-dependent representation of Electroglottography (EEG) signals. Tang et al. [15] use parallel CNNs for face recognition, then they use a voting system to finally classify the input. In the same sense, Brunese et al. [16] employ parallel machine learning classifiers for diagnosing brain tumors, and the classification is also decided by voting. In Shudarson & Kokil [17], an ensemble CNN model comprising ResNet-101 [18], MobileNet-v2 [19], and ShuffleNet [20] working in parallel were used for classifying B-mode kidney ultrasound images, then the classification is also performed by voting.

Regarding ensemble models applied for detecting COVID-19, other works using X-ray images have been proposed. Das et al. [21] suggested an ensemble model comprising Dense-Net201, ResNet50v2, and InceptionV3 that executes in parallel; then, the classification is done based on the weighted average ensembling of the three models in an X-ray dataset. Rahimzade and Attar [22] used two CNNs (Xception and ResNet50_v2) in parallel and then concatenated the results using another CNN followed by a fully-connected Multilayer Perceptron (MLP) in an X-ray dataset. Islam et al. [23] combine the exit of a 5-layered CNN into a bag of LSTM (Long Short-Term Memory) followed by a fully-connected network that provides the classification in an X-ray dataset. Recent studies using X-rays are Gopatoti & Vijayalakshmi [2], Hamza et al. [24], and [25].

Concerning CT images, Mobiny et al. [26] use a Detail-Oriented Capsule Networks (DECAPS), a ResNet with three residual blocks as the base network, followed by a 1 × 1 convolutional layer. The remaining layers are capsule layers. Moreover, they proposed a Peekaboo training method for the proposed CNN. This proposal reached a precision of 84.3%, a recall of 91.5%, an F1-Score of 87.1%, and an accuracy of 87.6%.

Singh et al. [4] proposed a VGG16 for feature extraction followed by a PCA algorithm for dimensionality reduction, then an ensemble SVM bagging approach does the classification. The proposed ensemble method got a precision of 95.7%, an F1-Score of 95.3%, and an accuracy of 95.7%.

Song et al. [27] proposed an ensemble model composed by pre-processing and a DRENet. In the pre-processing stage, the image is scaled, sub-images with the interesting areas are created, and relational features are obtained. The DRENet is devised by three parallel ResNet50, each receiving a result of the pre-processing stage. An MLP Net combines the exit of each ResNet. They got a precision, a recall, an F1-Score, and an accuracy of 93%.

Foysal & Aowlad Hossain [28] three independent CNNs. The first and the third ones with three convolutional plus max-pooling varying the existence of dropout on each layer. The second model is composed of four layers. Then, each model is followed by flattened dense models with two or three layers. The final classification is obtained by voting. They reached results of 95.6% in F1-Score, 97% in a recall, and accuracy of 96%.

Ahmed et al. [29] combine 2D and 3D approaches in a method called IST-CovNet, in which they mix an Inception-ResNet-v2 with attention mechanism and LSTM network, reaching an accuracy of 93.69%.

Kundu presents two types of research using stacking models. In the first one, [30], the author staked InceptionV3, ResNet34, and DenseNet201 ensembling the results by averaging the probability of each CNN, resulting in accuracy, precision, recall, and F1-Score of 98% or so. In the second paper [31], the authors stacked VGG-11, InceptionV3, and ResNet50, ensembling the results using Fuzzy Rank-based Fusion, also reaching an accuracy, precision, recall, and F1-Score of 98% or so.

Kini et al. [32] ensemble three models (ResNet152V2, DenseNet201, and Inception-ResNet-V2) trained using 10-folds in a four-classed dataset and embedded it in an IoT device reaching a precision of 99.08%, a recall of 99.28%, an f1-score of 98.91%, and an accuracy of 99.12%.

Table 1 summarizes the results of the related works, in which we can see those ensemble models tend to present better results than non-ensemble ones. For more information, we recommend the reviews from Subramanian et al. [34] and Portela et al. [35]. As previously mentioned, we ensemble three parallel CNNs using transfer learning in this work. The final classification is performed by an ANN instead of the traditional voting or average system.

Table 1.

Results of the state-of-art CNNs using ensemble and non-ensemble approaches.

Ref. Year Ens. Precision Recall F1 Score Accuracy
[9] 2020 No 61.1% 86%
[10] 2020 No 90%
[11] 2020 No 79% 96% 87% 86%.
[12] 2020 No 68.9% 76.5% 72.5%
[33] 2021 No 99/13% 92.49% 95.70% 93.87%
[2] 2022 No 76.5% 68.9% 72.5%
[26] 2020 Yes 84.3% 91.5% 87.1% 87.6%
[4] 2021 Yes 95.7% 95.3% 95.7
[27] 2021 Yes 93% 93% 93% 93%
[28] 2021 Yes 97% 95.6% 96%
[31] 2021 Yes 98.93 98.93 98.93 98.93
[29] 2022 Yes 93.69
[30] 2022 Yes 97.81% 97.77% 97.81% 97.77%
[32] 2022 Yes 99.08% 99.28% 98.91% 99.12%

3. Material and method

3.1. Convolutional Neural Networks (CNNs)

A CNN architecture is built with three components: convolutional layers, pooling layers, and a fully connected layer. Convolutional layers use filters that examine a part of the image and extract resources from it. These resources are usually colors, shapes, and borders that define a specific image [36].

A convolutional neural network (CNN) is a variation of the Multilayer Perceptron (MLP), whose purpose is to work on images. By biological inspiration, CNNs emulate the primary mechanism of the animal’s visual cortex. However, unlike MLPs, in which each neuron has a separate weight, CNNs share weights. Moreover, a CNN can have as many convolutional layers as needed. The more convolutional layers, the more resources are extracted. On the other hand, the more convolutional layers, the more computational resources are needed.

Using the weight-sharing strategy, neurons are able to perform convolutions on the data with the convolution filter formed by the weights. Then, the convolutional layer is followed by a pooling operation (pooling layer), a form of non-linear subsampling that progressively reduces the representation’s spatial size, diminishing the number of parameters and the required computation to training and using the CNN. All in all, pooling layers are responsible for grouping the feature maps in an image [36], providing a dimensionality reduction, produced by applying a single maximum or average of the values within a box, a matrix produced by the convolutional layer.

After a set of layers comprising convolution and pooling, the input matrix’s size was reduced, and complex characteristics have been extracted. Eventually, with a small enough feature map, the content is flattened into a one-dimensional vector and fed into a fully connected MLP for processing.

Fig. 1 presents an example of a CNN devised by three pairs of convolutional-pooling layers followed by a fully connected MLP with five layers. Usually, the output layer uses a softmax activation function for classification, while the other layers utilize ReLU as an activation function.

Fig. 1.

Fig. 1

A CNN example with 11 layers. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

3.2. Transfer learning

Machine learning models are mostly built to work in a specific problem, i.e., we need to rebuild it when data changes. On the other hand, the previous knowledge of those models can usually be applied for similar tasks. Thus, rather than rebuilding the model, which requires lots of effort, especially in image-based algorithms, the transfer learning can reuse the model and the acquired knowledge, consequently decreasing the model development time and usually improving the model performance [37].

As we stated previously, transfer learning is reusing a pre-trained model in a new problem. In other words, the model has already been trained using a different image dataset than the one we are willing to use. A typical dataset used in pre-training is the ImageNet dataset, comprising millions of images from nature in several categories. Training such a large mass of data is very expensive; it can take many days in advanced GPUs and energy expenditure, consequently, money.

In short, the purpose of Transfer Learning is to take advantage of all this knowledge that was generated using these images, whether in the most initial layers, such as the recognition of edges, curves, and colors, or in the innermost convolutional layers where they learn the texture, smooth. The more specialized layers are not interesting for learning, so only the initial layers are used that recognize basic details that theoretically every image has.

3.3. Ensemble learning

Ensemble learning refers to the procedures used to train various learning models and combines their results, treating them as a “committee” of decision-makers. The principle is that individual classification combined appropriately can achieve better overall accuracy, on average, than any individual model.

According to Brown [38], numerous empirical and theoretical studies have shown that ensemble models often achieve greater precision than individual models. The ensemble members may predict numbers with real value, class labels, later probabilities, classifications, or any other data. According to Faceli et al. [39], ensemble methods can be homogeneous or heterogeneous. In the first one, different models are generated by the same algorithm using different training samples. The most prominent examples of this kind of ensemble method are bagging [40] and boosting [41].

In the heterogeneous methods, there are three types of ensemble algorithms: stacking [42], cascade [43], meta-learning, and hybrid. In the stacking method, the first level is composed of different models that can be generated from different algorithms; then, they are put together using a classifier model in level 1. The cascade method is a sequence of classifiers in which each classifier can construct new features, and the last one gives the final classification. In Meta-learning, an arbiter selects the classification from one of the models [44]. Finally, in the hybrid model, different regions of the input space are approximated using different models and algorithms [39]. In this context, we propose a heterogeneous stacking ensemble method devised by three different CNN models in level zero and a fully connected neural network to combine the CNN models in level one.

All in all, the main principle of ensemble methods is to decrease their general susceptibility to errors, making them more robust. Therefore, ensemble methods must consider how they group the models, associating the algorithms to minimize their individual disadvantages in the final model. In the case of regression, the ensemble prediction is calculated as the average of the member predictions. In predicting a class label, the forecast is calculated as the mode of the member predictions. In the case of predicting a class probability, the prediction can be computed as the argmax of the summed probabilities for each class label.

A limitation of this approach is that each model has an equal contribution to the final prediction. For example, considering a classification problem in Random Forest [45], several decision trees obtain their classification. All of them have the same importance. Thus, the final classification is made by voting, winning that class that presents more votes, which is similar to the wisdom of the crowd. This model is known as a homogeneous ensemble approach because the same algorithm generates all models. The diversity between models is obtained by creating multiple hypotheses on different training samples.

On the other hand, we used a heterogeneous algorithm, in which models are usually created by different algorithms that execute in parallel. Then, the exit of each model is combined to obtain the final classification. The voting system can be used for this task; however, because much more complex algorithms are used to create models, the number of models in the stacking approach is normally small, which does not represent the crowd’s wisdom. In this context, other combination approaches have been proposed, such as arbiter trees, combiner trees, grading, and gating network [46]. Recent applications of stacking, such as [27], [28], [47], [48], [49] have shown that this approach can produce more stable results in complex problems than a traditional approach based on a single model. That is the reason why we chose the stacking approach in this investigation.

3.4. Method: Ensemble CNNs

The EnsembleDVX model was implemented by combining three different architectures: DenseNet169, VGG16, and Xception. The subnets have been incorporated into a larger neural network with several heads, which can then learn to match each input submodel’s predictions better. Thus allowing the stacking set to be treated as a single large model.

The benefit of this approach is that the submodels’ results are provided directly to a single model. Additionally, it is also possible to update the submodel weights in conjunction with the single model if necessary.

Each model’s outputs were merged using a simple concatenation merge neural network devised by three layers with six, ten, and two neurons, respectively, in which a single vector was created from the two-class probabilities predicted by each model. A hidden layer was then defined using “ReLU” as the activation function for the model and an output layer with softmax to separate the classes to make its probabilistic forecast. A graph of the created network can be seen in Fig. 2.

Fig. 2.

Fig. 2

Proposal of combining the output of the three referred CNNs.

In fact, we also tested all CNN’s combinations: Dense-Net169 + VGG16, DenseNet169 + Xception, VGG + Xception, and DenseNet169 + VGG16 + Xception. The best results were achieved using the referred three as we present next. Moreover, as every CNN was pre-trained using the CT images, it is necessary to train only the new hidden and exit layers.

4. Experiments

This section shows the experiment results, including how the three best CNNs were chosen and a comparison between the best state-of-art CNN against the ensemble models.

4.1. Setup

All experiments were carried out using a machine with the following configurations:

  • CPU: Intel Core i7 3770;

  • Memory: 8 GB Dual-Channel DDR3;

  • GPU: 4 GB Nvidia GeForce GTX 960;

  • Disk: 240 GB SSD Kingston.

The models were created using “include_top = false” to take the specific piece out of raw images when starting with ImageNet weights. “ReLU” was used as an activation function and “he_normal” as kernel weight in dense layers. Further, a dropout of 0.3 was also used in each dense layer. Lastly, the softmax function is used to separate the classes.

Adam was used with a learning rate of 0.002, the loss function as categorical_crossentropy, and accuracy as a metric. Soon after, the models were executed using k-fold cross-validation with k = 5, i.e., the proportion is 80/20. The fit_generator, i.e., the number of epochs, was set to 50 periods for each architecture.

The following Callbacks were also used to carry out actions at various training stages:

  • ModelCheckPoint: to save the model that presents the best loss during training;

  • EarlyStop: to stop training if the network overfits;

  • ReduceLROnPlateau: to decrease the learning rate if the validation loss does not change.

4.2. Dataset

We used a dataset called the SARS-COV-2 CT-Scan [50], a public dataset of actual patients’ CTs consisting of 2477 computed tomography images used to detect lung diseases. From the dataset, 1250 are positive for COVID-19, and 1227 are negative diagnoses. Fig. 3, Fig. 4 show a dataset sample with images of people with positive and negative diagnoses, respectively.

Fig. 3.

Fig. 3

CTs positive to COVID-19.

Fig. 4.

Fig. 4

CTs negative to COVID-19.

4.3. Metrics

The first step in evaluating classification algorithms is determining the metrics, which are values based on four possibilities: true positives, true negatives, false positives, and false negatives. True positives and true negatives are correct classifications, i.e., COVID-19 and non-COVID-19 appropriately classified. On the other hand, false positives and false negatives are wrong classifications. Now, we can define the metrics we used.

Having presented the elements false and true positives/negatives, Eq. (1) expresses the first metric called precision. This metric is the proportion between true positives and true positives plus false positives. Thus, low precision can indicate that the number of correct classifications is too low or that the number of false positives is high.

Precision=TPTP+FP (1)

The following metric is the accuracy, as presented in Eq. (2), which is the percentage of accurate classifications. A low accuracy could indicate that the number of wrong classifications (false positives and false negatives) is high.

Accuracy=TP+TNTP+FP+TN+FN (2)

A metric that shows if the classification algorithm is performing well in classifying true positives is called precision, as depicted in Eq. (3). In short, this metric is the proportion between true positives and true positives plus false positives. It is used trying to minimize the number of false negatives, producing the patient’s worst scenario.

Recall=TPTP+FN (3)

A harmonic mean between precision and recall is expressed by the F1 Score, as presented in Eq. (4). Thus, the F1 Score presents a big picture of the classification’s performance because it considers both false positives and false negatives, i.e., it evaluates if the algorithm provides too many misclassifications.

F1_score=2×precision×recallprecision+recall (4)

Next, we present how the sub-models were chosen to devise the ensemble algorithm.

4.4. Comparing state-of-art CNNs

The first step in this investigation was selecting the best three CNNs architectures to build the ensemble model. Thus, we tested eleven state-of-art CNNs architectures, as presented in Table 2 showing the average of all metrics after training each CNN using transfer learning with fine-tuning to choose the top three best means. Further, the table presents the loss during the training stage. As we can see, DenseNet169 achieved the best accuracy, precision, recall, and F1_score of 96.4%, 96.3%, 96.4%, and 96.4%, respectively, with a loss of only 0.097. VGG16 and Xception follow the best result of DenseNet 169. VGG16 achieved accuracy, precision, recall, and F1_score of 95.1%, 95.2%, 95.1%, and 95.1%, respectively, with a loss of 0.134. In contrast, Xception reached 95.1% in all metrics, with a loss of 0.150.

Table 2.

Average results of state-of-art CNNs using transfer learning.

CNN model Precision Recall F1 Score Accuracy Loss
DenseNet169 0.963 0.964 0.964 0.964 0.097
DenseNet121 0.946 0.947 0.947 0.947 0.155
DenseNet201 0.949 0.948 0.948 0.948 0.138
VGG16 0.952 0.951 0.951 0.951 0.134
VGG19 0.947 0.946 0.946 0.946 0.134
Xception 0.951 0.951 0.951 0.951 0.150
ResNet50 0.943 0.942 0.942 0.943 0.149
ResNet50V2 0.940 0.940 0.939 0.939 0.163
InceptionV3 0.940 0.938 0.939 0.939 0.162
MobileNet 0.935 0.934 0.934 0.934 0.165
MobileNetV2 0.634 0.613 0.588 0.617 0.644

The training curves of DenseNet169 can be seen in Fig. 5, in which the training curve tends to be as expected, i.e., towards an accuracy of 1 (100%) and a loss of 0. On the other hand, the test curve tends to oscillate, which is not advisable because the training step can stop at a peak.

Fig. 5.

Fig. 5

Accuracy vs. Loss per fold — DenseNet169.

Moreover, Table 3 shows a comparison of EnsembleDVX versus ET-NET [30] both using k-fold, with k = 5, 100 epochs, and the same dataset, in which we can see a subtle difference for ET-NET. Furthermore, our standard deviation is minor, meaning that the EnsembleDVX method tends to be more stable, i.e., less affected by data variation. On the other hand, a Mann–Whitney test with α=95% of significance indicates that there is no evidence to reject the null hypothesis H0:μa=μb, i.e., the difference is not meaningful. In other words, because there is no significant difference statistically, the choice would rely on the model less dependent on data.

Table 3.

Comparing EnsembleDVX vs. ET-NET.

Model K-Fold Precision Recall F1 Score Accuracy
ET-NET [30] Fold-1 98.181 98.18 98.18 98.18
Fold-2 97.976 97.98 97.98 97.98
Fold-3 97.608 97.56 97.61 97.57
Fold-4 98.381 98.38 98.38 98.38
Fold-5 96.889 96.74 96.89 96.76
Mean/SD 97.81±0.53 97.77±0.58 97.81±0.52 97.77±0.57

EnsembleDVX Fold-1 97.33 97.43 97.37 97.38
Fold-2 97.12 97.24 97.17 97.18
Fold-3 97.74 97.81 97.78 97.78
Fold-4 97.53 97.62 97.57 97.58
Fold-5 97.33 97.43 97.37 97.38
Mean/SD 97.41±0.0021 97.51±0.0019 97.45±0.00151 97.46±0.0020

p-value 0.2087 0.2948 0.2087 0.2948

4.5. Comparing ensemble CNNs

Next, we trained and tested all combinations of the three best CNNs found previously. Note that to train the ensemble models, it is necessary to mark the sub-model (the CNNs) as non-trainable; consequently, their weights are not updated during the training stage. Moreover, Keras also requires that each layer has a unique name; thus, each layer’s name on each of the loaded models has been modified to indicate which member of the group they belong to. Once the sub-models were prepared, the stacking ensemble model was defined. The output layer of each sub-model was used as a separate input header for the new model.

The average results of the ensemble models can be seen in Table 4, where the combination of the three networks proved superior to the others, reaching an accuracy, precision, recall, and f-score of 97.7%, 97.7%, 97.7%, and 97.8%, respectively, with a loss of only 0.077. Furthermore, we add the DenseNet169 in the table to show how all ensemble combinations achieve better metrics than the referred CNN architecture.

Table 4.

Average results of the ensemble models.

CNN model Precision Recall F1 Score Accuracy Loss
DenseNet169 + VGG16 + Xception 0.977 0.977 0.978 0.977 0.077
DenseNet169 + VGG16 0.972 0.972 0.973 0.972 0.080
DenseNet169 + Xception 0.976 0.976 0.976 0.976 0.073
VGG16 + Xception 0.975 0.974 0.952 0.974 0.093
DenseNet169 0.963 0.964 0.964 0.964 0.097

Fig. 6 shows the training and test curves for the Ensemble DVX. As we can see, The behavior of the curves was as expected in both. Furthermore, the testing curve was much more stable than that presented by DenseNet169, indicating a better training and test process.

Fig. 6.

Fig. 6

Accuracy vs. Loss per fold — EnsembleDVX.

Furthermore, Table 5 shows the average confusion matrix comparing EnsembleDVX against DenseNet169, in which we can observe that EnsembleDVX was incorrect in only 11 cases. However, 7 cases were false positives for COVID-19. On the other hand, the notable fact is that only 4 cases of COVID-19 were misclassified as non-COVID-19. While DenseNet169 was incorrect in 18 cases, 12 cases were false positives for COVID-19, and 6 cases of COVID-19 were erroneously classified as negatives.

Table 5.

Confusion matrix: EnsembleDVX vs. DenseNet169.

EnsembleDVX
DenseNet169
Positive Negative Positive Negative
Positive 260 7 252 12
Negative 4 230 6 224

4.6. Discussion

The outbreak of COVID-19 resulted in a worldwide spread of panic, mainly because the symptoms can differ from person to person, varying from asymptomatic patients to a quick death. Thus, the early diagnosis of the disease is vital for trying treatment and isolating the patient to prevent the virus from spreading. In this context, it has been reported that chest CT could be used as a reliable and rapid approach for screening of COVID-19 [51], [52]. However, the diagnosis depends on a physician who has to detect and evaluate conditions in the CT such as ground-glass opacification, consolidation, bilateral involvement, and peripheral and diffuse distribution [53]. On the other hand, humans are susceptible to physical e psychological distress that can lead to a misdiagnosis. Therefore, it is essential to provide tools that make the diagnosis faster and more precise, like machine learning and convolutional neural networks in this particular segment of COVID-19 detection using image-based exams.

In this investigation, we proposed an ensemble model for detecting COVID-19 in chest CT scans using a popular dataset. We also show that ensemble methods tend to present better results, as we can see in Table 1, in which non-ensemble models reach at most 79% in precision, 96% in the recall, 87% in F1-Score, and 86% in accuracy. Concerning ensemble methods, those results increase to 95.7%, 97%, 95.6%, and 96%, respectively.

The process of ensembling the CNNs started by comparing the state-of-art architecture using transfer learning, in which DenseNet169, VGG16, and Xception presented the best results. The second stage was to test all combinations of these architectures three-by-three and two-by-two, covering all possible combinations, training only the MLP that performs the final classification because each model is already trained in the first stage. The best results were presented by the combination of the three selected CNNs as shown in Table 4, reaching a precision of 97.7%, a recall of 97.7%, an F1-score of 97.8%, and accuracy of 97.7%, overcoming the literature presented in Section 2.

Moreover, the differences in stability between the EnsembleDVX and DenseNet169 can be seen in Fig. 5, Fig. 6 in each fold, giving special attention to the loss curve that once it decreases, it does not go up again as happens in DenseNet169. Furthermore, a Mann–Whitney test showed that the EnsembleDVX achieved the same performance as ET-NET, a state-of-art ensemble algorithm.

Regarding the limitations of the method, we investigated only two classes of diagnoses because the dataset has no samples of other diseases, such as viral pneumonia, that present similar image characteristics to COVID-19. Thus, we cannot confirm that the resulting model will not provide a false positive when submitting a viral pneumonia case to the ensemble model. In this context, testing the ensemble model distinguishing between COVID-19 and viral pneumonia would be desirable.

5. Conclusions

We presented a new ensemble model, called EnsembleDVX, combining three different CNNs models for helping the COVID-19 diagnosis using computerized tomography scans. The first step was selecting the three best CNNs using the k-fold resampling method with k=5. Results indicated that the best models were DenseNet169, VGG16, and Xception. Next, we needed to decide how to combine the classification of each model. Unlike the other proposals that use parallel ensemble models and make-decision by voting, this work proposed an MLP to combine each model’s classification to provide the final decision, similarly to Song’s [27] work.

Furthermore, all combinations of the three best state-of-art CNN architectures were tested; however, even though the ensemble models combined 2-by-2 presented better results than DenseNet169, the final best results were shown in the EnsembleDVX. All in all, results showed that the ensemble models overcame the performance of DenseNet169, the best state-of-art CNN in Precision (97.7%), Recall (97.7%), F1 Score (97.8%), and Accuracy (97.7%). Moreover, Fig. 6 presented that the EnsembleDVX model behaved much more stable in the training and test stages, converging both curves (loss and accuracy).

Future work includes adding viral pneumonia to the dataset in order to differentiate between COVID-19, Viral Pneumonia, and Health Lung; and embedding the final model in a cloud service to be used by registered physicians.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

The dataset is publicly available.

References

  • 1.World Health Organization . 2022. Weekly epidemiological update on COVID-19 - 27 July 2022. https://www.who.int/publications/m/item/weekly-epidemiological-update-on-covid-19---27-july-2022. Accessed: Aug-05-2022. [Google Scholar]
  • 2.Gopatoti A., Vijayalakshmi P. CXGNet: A tri-phase chest X-ray image classification for COVID-19 diagnosis using deep CNN with enhanced grey-wolf optimizer. Biomed Signal Process Control. 2022;77 doi: 10.1016/j.bspc.2022.103860. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Amaro E., Jr., Yamashita H. Aspectos básicos de tomografia computadorizada e ressonância magnética. Braz J Psychiatry. 2001;23:2–3. URL: http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1516-44462001000500002&nrm=iso. [Google Scholar]
  • 4.Singh M., Bansal S., Ahuja S., Dubey R.K., Panigrahi B.K., Dey N. Transfer learning–based ensemble support vector machine model for automated COVID-19 detection using lung computerized tomography scan data. Med Biol Eng Comput. 2021;59:825–839. doi: 10.1007/s11517-020-02299-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Polikar R. In: Ensemble machine learning: Methods and applications. Zhang C., Ma Y., editors. Springer US; Boston, MA: 2012. Ensemble learning; pp. 1–34. [Google Scholar]
  • 6.Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z., Karpathy A., Khosla A., Bernstein A.C., Fei-Fei L. ImageNet large scale visual recognition challenge. Int J Comput Vis. 2015;115:211–251=2. [Google Scholar]
  • 7.Manjunath R.V., Kwadiki K. Automatic liver and tumour segmentation from CT images using deep learning algorithm. Results Control Optim. 2022;6 doi: 10.1016/j.rico.2021.100087. URL: https://www.sciencedirect.com/science/article/pii/S2666720721000497. [DOI] [Google Scholar]
  • 8.Xu Z., Ren H., Zhou W., Liu Z. ISANET: Non-small cell lung cancer classification and detection based on CNN and attention mechanism. Biomed Signal Process Control. 2022;77 [Google Scholar]
  • 9.Yang X., He X., Zhao J., Zhang Y., Zhang S., Xie P. 2020. COVID-CT-dataset: A CT scan dataset about COVID-19. [Google Scholar]
  • 10.Li L., Qin L., Xu Z., Yin Y., Wang X., Kong B., Bai J., Lu Y., Fang Z., Song Q., Cao K., Liu D., Wang G., Xu Q., Fang X., Zhang S., Xia J., Xia J. Using artificial intelligence to detect COVID-19 and community-acquired pneumonia based on pulmonary CT: Evaluation of the diagnostic accuracy. Radiology. 2020;296(2):E65–E71. doi: 10.1148/radiol.2020200905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ying S., Zheng S., Li L., Zhang X., Zhang X., Huang Z., Chen J., Zhao H., Wang R., Chong Y., Shen J., Zha Y., Yang Y. Deep learning enables accurate diagnosis of novel coronavirus (COVID-19) with CT images. MedRxiv. 2020 doi: 10.1101/2020.02.23.20026930. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Xu X., Jiang X., Ma C., Du P., Li X., Lv S., Yu L., Ni Q., Chen Y., Su J., Lang G., Li Y., Zhao H., Liu J., Xu K., Ruan L., Sheng J., Qiu Y., Wu W., Liang T., Li L. A deep learning system to screen novel coronavirus disease 2019 pneumonia. Engineering. 2020;6(10):1122–1129. doi: 10.1016/j.eng.2020.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ashour A.S., Eissa M.M., Wahba M.A., Elsawy R.A., Elgnainy H.F., Tolba M.S., Mohamed W.S. Ensemble-based bag of features for automated classification of normal and COVID-19 CXR images. Biomed Signal Process Control. 2021;68 doi: 10.1016/j.bspc.2021.102656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zheng X., Chen W., You Y., Jiang Y., Li M., Zhang T. Ensemble deep learning for automated visual classification using EEG signals. Pattern Recognit. 2020;102 [Google Scholar]
  • 15.Tang J., Su Q., Su B., Fong S., Cao W., Gong X. Parallel ensemble learning of convolutional neural networks and local binary patterns for face recognition. Comput Methods Programs Biomed. 2020;197 doi: 10.1016/j.cmpb.2020.105622. [DOI] [PubMed] [Google Scholar]
  • 16.Brunese L., Mercaldo F., Reginelli A., Santone A. An ensemble learning approach for brain cancer detection exploiting radiomic features. Comput Methods Programs Biomed. 2020;185:105–134. doi: 10.1016/j.cmpb.2019.105134. [DOI] [PubMed] [Google Scholar]
  • 17.Sudharson S., Kokil P. An ensemble of deep neural networks for kidney ultrasound image classification. Comput Methods Programs Biomed. 2020;197 doi: 10.1016/j.cmpb.2020.105709. [DOI] [PubMed] [Google Scholar]
  • 18.He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. 2016 IEEE conference on computer vision and pattern recognition; CVPR; 2016. pp. 770–778. [Google Scholar]
  • 19.Sandler M., Howard A., Zhu M., Zhmoginov A., Chen L. 2018 IEEE/CVF conference on computer vision and pattern recognition. 2018. MobileNetV2: Inverted residuals and linear bottlenecks; pp. 4510–4520. [DOI] [Google Scholar]
  • 20.Zhang X., Zhou X., Lin M., Sun J. 2018 IEEE/CVF conference on computer vision and pattern recognition. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices; pp. 6848–6856. [Google Scholar]
  • 21.Das A., Ghosh S., Thunder S., Dutta R., Agarwal S., Chakrabarti A. Automatic COVID-19 detection from X-ray images using ensemble learning with convolutional neural network. Pattern Anal Appl. 2021 [Google Scholar]
  • 22.Rahimzadeh M., Attar A. A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of Xception and ResNet50V2. Inform Med Unlocked. 2020;19 doi: 10.1016/j.imu.2020.100360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Islam M.Z., Islam M.M., Asraf A. A combined deep CNN-LSTM network for the detection of novel coronavirus (COVID-19) using X-ray images. Inform Med Unlocked. 2020;20 doi: 10.1016/j.imu.2020.100412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Hamza A., Attique Khan M., Wang S., Alqahtani A., Alsubai S., Binbusayyis A., Hussein H.S., Martinetz T.M., Alshazly H. COVID-19 classification using chest X-ray images: A framework of CNN-LSTM and improved max value moth flame optimization. Front Public Health. 2022;10 doi: 10.3389/fpubh.2022.948205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Hamza A., Attique Khan M., Wang S., Alhaisoni M., Alharbi M., Hussein H.S., Alshazly H., Kim Y.J., Cha J. COVID-19 classification using chest X-ray images based on fusion-assisted deep Bayesian optimization and Grad-CAM visualization. Front Public Health. 2022;10 doi: 10.3389/fpubh.2022.1046296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Mobiny A., Cicalese P.A., Zare S., Yuan P., Abavisani M., Wu C.C., Ahuja J., de Groot P.M., Van Nguyen H. 2020. Radiologist-level COVID-19 detection using CT scans with detail-oriented capsule networks. arXiv e-prints. [Google Scholar]
  • 27.Song Y., Zheng S., Li L., Zhang X., Zhang X., Huang Z., Chen J., Wang R., Zhao H., Chong Y., Shen J., Zha Y., Yang Y. Deep learning enables accurate diagnosis of novel coronavirus (COVID-19) with CT images. IEEE/ACM Trans Comput Biol Bioinform. 2021;18(6):2775–2780. doi: 10.1109/TCBB.2021.3065361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Foysal M., Aowlad Hossain A.B.M. COVID-19 detection from chest CT images using ensemble deep convolutional neural network. 2021 2nd international conference for emerging technology; INCET; 2021. pp. 1–6. [DOI] [Google Scholar]
  • 29.Ali Ahmed S.A., Yavuz M.C., Şen M.U., Gülşen F., Tutar O., Korkmazer B., Samancı C., Şirolu S., Hamid R., Eryürekli A.E., Mammadov T., Yanikoglu B. Comparison and ensemble of 2D and 3D approaches for COVID-19 detection in CT images. Neurocomputing. 2022;488:457–469. doi: 10.1016/j.neucom.2022.02.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Kundu R., Singh P.K., Ferrara M., Ahmadian A., Sarkar R. ET-NET: an ensemble of transfer learning models for prediction of COVID-19 infection through chest CT-scan images. Multimedia Tools Appl. 2022;81:31–50. doi: 10.1007/s11042-021-11319-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Kundu R., Basak H., Singh P., Ahmadian A., Ferrara M., Sarkar R. Fuzzy rank-based fusion of CNN models using Gompertz function for screening COVID-19 CT-scans. Sci Rep. 2021;11:14133. doi: 10.1038/s41598-021-93658-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kini A.S., Reddy N.G., Kaur M., Satheesh S., Singh J., Martinetz T., Alshazly G.H. Ensemble deep learning and internet of things-based automated COVID-19 diagnosis framework. Contrast Media Mol Imaging. 2022;2022 doi: 10.1155/2022/7377502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Alshazly H., Linse C., Abdalla M., Barth E., Martinetz T. COVID-Nets: deep CNN architectures for detecting COVID-19 using chest CT scans. PeerJ Comput Sci. 2021;7(e655):14133. doi: 10.7717/peerj-cs.655. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Subramanian N., Elharrouss O., Al-Maadeed S., Chowdhury M. A review of deep learning-based detection methods for COVID-19. Comput Biol Med. 2022;143 doi: 10.1016/j.compbiomed.2022.105233. URL: https://www.sciencedirect.com/science/article/pii/S0010482522000257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Portela E.P., Cortes O.A.C., da Silva J.C. 22nd international conference on intelligent systems design and applications. 2022. A rapid review on ensemble algorithms for COVID-19 classification using image-based exams. [Google Scholar]
  • 36.Beysolow II T. A Press; 2017. Introduction to deep learning using R: A step-by-step guide to learning and implementing deep learning models using R. [Google Scholar]
  • 37.Kaya A., Keceli A.S., Catal C., Yalic H.Y., Temucin H., Tekinerdogan B. Analysis of transfer learning for deep neural network based plant classification models. Comput Electron Agric. 2019;158:20–29. URL: https://www.sciencedirect.com/science/article/pii/S0168169918315308. [Google Scholar]
  • 38.Brown G. Encyclopedia of machine learning, Vol. 312. 2018. Ensemble learning; pp. 15–19. URL: http://www.cs.man.ac.uk/~gbrown/research/brown10ensemblelearning.pdf, Visited: 05/10/2020. [Google Scholar]
  • 39.Faceli K., Lorena A.C., Gama J., Almeida T.A.d., Carvalho A.P.d.L.F. 2nd ed. LTC; 2021. Inteligência artificial: Uma abordagem de aprendizado de máquina. [Google Scholar]
  • 40.Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–140. [Google Scholar]
  • 41.Freund Y., Schapire R.E., et al. ICML, Vol. 96. Citeseer; 1996. Experiments with a new boosting algorithm; pp. 148–156. [Google Scholar]
  • 42.Wolpert D.H. Stacked generalization. Neural Netw. 1992;5(2):241–259. [Google Scholar]
  • 43.Gama J., Brazdil P. Cascade generalization. Mach Learn. 2000;41:315–343. [Google Scholar]
  • 44.Kordik P., Cerny J., Fryda T. Discovering predictive ensembles for transfer learning and meta-learning. Mach Learn. 2018;107:177–-207. doi: 10.1007/s10994-017-5682-0. [DOI] [Google Scholar]
  • 45.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. [Google Scholar]
  • 46.Rokach L. World Scientific; 2010. Pattern classification using ensemble methods, Vol. 75. [Google Scholar]
  • 47.Liang M., Chang T., An B., Duan X., Du L., Wang X., Miao J., Xu L., Gao X., Zhang L., et al. A stacking ensemble learning framework for genomic prediction. Front Genet. 2021;12 doi: 10.3389/fgene.2021.600040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Wang Y., Wang D., Geng N., Wang Y., Yin Y., Jin Y. Stacking-based ensemble learning of decision trees for interpretable prostate cancer detection. Appl Soft Comput. 2019;77:188–204. doi: 10.1016/j.asoc.2019.01.015. [DOI] [Google Scholar]
  • 49.Yi H.-C., You Z., Wang M.-N., Guo Z.-H., Wang Y., Zhou J.-R. RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information. BMC Bioinformatics. 2020;21 doi: 10.1186/s12859-020-3406-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Soares E., Angelov P., Biaso S., Froes M.H., Abe D.K. SARS-CoV-2 CT-scan dataset: A large dataset of real patients CT scans for SARS-CoV-2 identification. MedRxiv. 2020 [Google Scholar]
  • 51.Ai T., Yang Z., Hou H., Zhan C., Chen C., Lv W., Tao Q., Sun Z., Xia L. Correlation of chest CT and RT-PCR testing for coronavirus disease 2019 (COVID-19) in China: A report of 1014 cases. Radiology. 2020;296(2):E32–E40. doi: 10.1148/radiol.2020200642. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Fang Y., Zhang H., Xie J., Lin M., Ying L., Pang P., Ji W. Sensitivity of chest CT for COVID-19: Comparison to RT-PCR. Radiology. 2020;296(2):E115–E117. doi: 10.1148/radiol.2020200432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Huang C., Wang Y., Li X., Ren L., Zhao J., Hu Y., Zhang L., Fan G., Xu J., Gu X., Cheng Z., Yu T., Xia J., Wei Y., Wu W., Xie X., Yin W., Li H., Liu M., Xiao Y., Gao H., Guo L., Xie J., Wang G., Jiang R., Gao Z., Jin Q., Wang J., Cao B. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet. 2020;395(10223):497–506. doi: 10.1016/S0140-6736(20)30183-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset is publicly available.


Articles from Results in Control and Optimization are provided here courtesy of Elsevier

RESOURCES