Adaptive Feature Selection Guided Deep Forest for COVID-19 Classification With Chest CT

Liang Sun; Zhanhao Mo; Fuhua Yan; Liming Xia; Fei Shan; Zhongxiang Ding; Bin Song; Wanchun Gao; Wei Shao; Feng Shi; Huan Yuan; Huiting Jiang; Dijia Wu; Ying Wei; Yaozong Gao; He Sui; Daoqiang Zhang; Dinggang Shen

doi:10.1109/JBHI.2020.3019505

. 2020 Aug 26;24(10):2798–2805. doi: 10.1109/JBHI.2020.3019505

Adaptive Feature Selection Guided Deep Forest for COVID-19 Classification With Chest CT

Liang Sun ¹, Zhanhao Mo ², Fuhua Yan ³, Liming Xia ⁴, Fei Shan ⁵, Zhongxiang Ding ⁶, Bin Song ⁷, Wanchun Gao ⁸, Wei Shao ¹, Feng Shi ⁹, Huan Yuan ⁹, Huiting Jiang ⁹, Dijia Wu ⁹, Ying Wei ⁹, Yaozong Gao ⁹, He Sui ², Daoqiang Zhang ^1,^✉, Dinggang Shen ^9,¹⁰

PMCID: PMC8545164 PMID: 32845849

Abstract

Chest computed tomography (CT) becomes an effective tool to assist the diagnosis of coronavirus disease-19 (COVID-19). Due to the outbreak of COVID-19 worldwide, using the computed-aided diagnosis technique for COVID-19 classification based on CT images could largely alleviate the burden of clinicians. In this paper, we propose an Adaptive Feature Selection guided Deep Forest (AFS-DF) for COVID-19 classification based on chest CT images. Specifically, we first extract location-specific features from CT images. Then, in order to capture the high-level representation of these features with the relatively small-scale data, we leverage a deep forest model to learn high-level representation of the features. Moreover, we propose a feature selection method based on the trained deep forest model to reduce the redundancy of features, where the feature selection could be adaptively incorporated with the COVID-19 classification model. We evaluated our proposed AFS-DF on COVID-19 dataset with 1495 patients of COVID-19 and 1027 patients of community acquired pneumonia (CAP). The accuracy (ACC), sensitivity (SEN), specificity (SPE), AUC, precision and F1-score achieved by our method are 91.79%, 93.05%, 89.95%, 96.35%, 93.10% and 93.07%, respectively. Experimental results on the COVID-19 dataset suggest that the proposed AFS-DF achieves superior performance in COVID-19 vs. CAP classification, compared with 4 widely used machine learning methods.

Keywords: COVID-19 classification, deep forest, feature selection, chest CT

I. Introduction

Since Decemeber 2019, the outbreak of coronavirus disease-19 (COVID-19) [1], [2] has infected more than 20 million people worldwide, and causing more than 800, 000 deaths. World Health Organization (WHO) has declared the COVID-19 as a global health emergency on January 30, 2020 [3]. The chest computed tomography (CT) has shown to be useful to assist clinical diagnosis of COVID-19 [4]–[9]. However, the rapid growth of COVID-19 patients results in the shortage of the clinicians and radiologists. It is highly desired to develop automatic methods for computer-aided COVID-19 classification with chest CT images.

A few machine learning methods have been proposed for COVID-19 classification using chest CT images. For example, [10] employs a logistic regression method for COVID-19 classification by using clinical and laboratory features. [11], [12] use random forest model with the handcrafted features for COVID-19 classification. Moreover, some deep learning based methods are proposed for the diagnosis of COVID-19. For instance, [13] leverages a deep learning method to learn the feature representation of the chest CT image, and then uses the learned features for COVID-19 classification by combining decision tree and Adaboost algorithm. In addition, [14] employs an end-to-end network to map the CT images to label space for COVID-19 disease identification.

In summary, the existing machine learning methods for COVID-19 diagnosis are mainly based on the handcrafted features or the learned image representation by deep neural networks. However, simply adopting handcrafted features cannot fully utilize the high-level information for COVID-19 classification, while the features learned by neural networks require great effort for parameter tuning with a small amount of medical image data.

To this end, in this paper, we propose a novel adaptive feature selection guided deep forest method that takes advantage of the high-level deep features with small number of medical image data for the classification between COVID-19 and CAP. Specifically, as shown in Fig. 1, we first extract the location-specific features from the chest CT image. Then, a deep forest model [15] is introduced to learn the latent high-level representations of these features, which can effectively describe the high-level information within the extracted location-specific features by using a small-scale training data. Intuitively, the use of the feature selection could promote the performance of the classification task. Hence, in our study, we also introduce a task-driven feature selection method to adaptively reduce the redundancy of features. In particular, the feature selection operation will discard a portion of unimportant features based on the feature importance which is calculated from trained forests in each deep forest layer. Hence, the feature selection and classifier training are adaptively incorporated into a unified framework. Finally, the trained adaptive feature selection guided deep forest is used for COVID-19 prediction.

Fig. 1. — Pipeline of the proposed adaptive feature selection guided deep forest for COVID-19 vs. CAP classification. We first extract the location-specific features from chest CT images. Then, the proposed AFS-DF is leveraged to train the classifier based on the location-specific features. Finally, based on the location-specific features, the trained AFS-DF is adopted for COVID-19 identification task.

The major contributions of this paper are three-fold. First, a deep forest is leveraged to learn the high-level feature representation of the location-specific features from chest CT for the diagnosis of COVID-19. Second, we propose an adaptive feature selection method to adaptively select the discriminative features for the diagnosis of COVID-19. Third, the proposed method is evaluated on the collected COVID-19 dataset, which consists of 1495 COVID-19 patients and 1027 CAP patients. Experimental results demonstrate that our method achieves superior classification performance than the comparison methods.

The rest of the paper is organized as follows. We first introduce the materials used in this study and the proposed adaptive feature selection guided deep forest in Section II. Then, in Section III, we present experimental settings and results. In Section IV, we study the influence of parameters in the proposed methods and present the limitations of the current study as well as possible future directions. We finally conclude this paper in Section V.

II. Materials and Method

In this section, we first introduce the dataset used in this study. Then, we present the feature extraction procedure for the location-specific features. Next, we describe the proposed adaptive feature selection guided deep forest (AFS-DF). Finally, we provide the implementation details for our proposed AFS-DF.

A. Materials

A total of 2522 chest CT images are used in our study, provided by China-Japan Union Hospital of Jilin University, Ruijin Hospital of Shanghai Jiao Tong University, Tongji Hospital of Huazhong University of Science and Technology, Shanghai Public Health Clinical Center of Fudan University, Hangzhou First People's Hospital of Zhejiang University, and Sichuan University West China Hospital. In this dataset, 1495 cases are from the confirmed COVID-19 cases diagnosed by positive nucleic acid testing. The other 1027 case are from CAP patients. COVID-19 images are acquired from Jan. 9, 2020 to Feb. 14, 2020, and CAP images are obtained from Jul. 30, 2018 to Feb. 22, 2020. The demographic information of these 2522 subjects is summarized in Table I.

TABLE I. Demographic Information of the Studied 2522 Chest CT Scans From COVID-19 Dataset. M/F: Male/Female.

Category	Scan #	Gender (M/F)
COVID-19	1495
CAP	1027

Open in a new tab

All patients underwent chest CT scans with thin section. Specifically, CT scanners include uCT 780 from UIH, Optima CT520, Discovery CT750, LightSpeed 16 from GE, Aquilion ONE from Toshiba, SOMATOM Force from Siemens, and SCENARIA from Hitachi. CT protocol includes: 120 Inline graphic , reconstructed slice thickness ranging from 0.625 to 2 , with breath hold at full inspiration. All images are de-identified before sending for analysis. The study are approved by the Institutional Review Board of participating institutes. Written informed consent is waived due to the retrospective nature of the study.

We evaluate the COVID-19 vs. CAP classification using 5-fold cross-validation on the collected chest CT images. Specifically, all subjects are randomly partitioned into 5 subsets (the first four subsets consist of 504 subjects, respectively. The last subset contains 506 subjects). Each subset is sequentially selected as the testing set (containing 504/506 subjects), while the remaining four subsets (containing 2018/2016 subjects) are treated as training set. The training set is further divided into five subsets for 5-fold cross-validation to choose the hyper-parameters in the proposed method and the competing methods. Then, we use all training data with chosen hyper-parameters to train the model. Finally, we test the trained model on the testing data (containing 504/506 subjects).

B. Feature Extraction

Similar to [12], we extract the location-specific features (i.e., infection locations and spreading patterns) to represent the chest CT images for diagnosis of COVID-19. Specifically, the chest CT images are first automatically segmented into infected lung regions and lung fields bilaterally by using VB-Net [16]. The infected lung regions are mainly related to mosaic sign, ground glass opacity (GGO), lesion-related signs (air bronchogram) and interlobular septal thickening. The lung fields include left lung, right lung, five lung lobes, and eighteen pulmonary segments. Then, we extract four kinds of location-specific handcrafted features, including volume, infected lesion number, histogram distribution and surface area from chest CT images. Meanwhile, we also extract the radiomics features for describing the CT images. More details are as follows.

1)
Volume features: Based on the segmented infected lung regions, we extract the total volume of infected region, and then calculate the percentage of the infected region of the whole lung. Meanwhile, according to the lung field segmentation results, we further extract the volume and percentage in each lobe and each pulmonary segment, respectively. Since there are evidence that pneumonia caused by COVID-19 more likely occurs in both right and left lungs, we also calculate the infected lesion difference as well as the percentage difference between left and right lungs.
2)
Infected lesion number: In comparison to CAP, most of the COVID-19 infections encompass bilateral lungs with multifocal involvement [6], [7], and COVID-19 generally has concentrated infection lesions while CAP shows small in volume and patchy in distribution [17]. Therefore, we calculate the features of the total number of infected regions in the bilateral lungs, lung lobes, and pulmonary segments, respectively.
3)
Histogram distribution: The predominant chest CT findings show that bilateral and peripheral GGO and consolidation are a radiologic hallmark of COVID-19 [18], [19]. GGO is a pattern of hazy increased lung opacity with preservation of bronchial and vascular margins, whereas consolidation is characterized by a homogeneous increase in lung parenchymal attenuation that obscures the margins of vessels and airway walls on CT images [17]. To extract the intensity distribution of the infected regions in chest CT images, we calculated the histogram features of the infected regions.
4)
Surface area: In previous study [20], it has been found that COVID-19 had a predominate distribution in the posterior and peripheral lung, and the abnormalities of lung parenchyma eventually spread to the central area and bilateral upper lobes [6]. Therefore, we constructed the infection surface as well as the lung boundary surface. We further calculated the distance of each infection surface vertex to the nearest lung boundary surface, and categorized them into 5 ranges, i.e., 3, 6, 9, 12 and 15 voxels. For each feature, the number of infection surface vertices within each range of distances to the lung wall is calculated. Furthermore, the percentage of infection vertex number against the number of whole infection surface vertices in each range is also considered.
5)
Radiomics features: Radiomics features extracted from infected lesions, including intensity features (e.g., average gray level intensity, range of gray values) and texture features (e.g., gray level co-occurrence matrix, gray-level run-length matrix, gray-level size-zone matrix, and neighborhood gray-tone difference matrix) are used in our study.

Besides, we also adopt the age and gender into the location-specific features for the diagnosis of COVID-19. In summary, a total of 239 dimensions features are used in our study. More details of these features are presented in [12].

C. Adaptive Feature Selection Guided Deep Forest

As mentioned in Section II-B, we extract location-specific features from the chest CT images. But, simply using these features cannot adequately describe the high-level information for COVID-19 classification. In this work, we propose an adaptive feature selection guided deep forest to learn the latent high-level representation of the extracted location-specific features with adaptive task-driven feature selection process for diagnosis of COVID-19. The architecture of proposed adaptive feature selection guided deep forest is shown in Fig. 2.

As shown in Fig. 2, each layer of proposed adaptive feature selection guided deep forest consists of Inline graphic independent random forests and a feature selection unit. Here, each random forest produces a probability distribution of the COVID-19 and CAP (the yellow rectangle in Fig. 2). Then, the probability distribution vectors of the COVID-19 and CAP are concatenated with the input feature vector. To reduce the redundancy of the features, we further perform an adaptive feature selection operation. In particular, for each trained random forest, we calculate the feature importance for each feature within the input feature vector. Thus, we calculate the overall feature importance Inline graphic for -th feature as follows,

where Inline graphic is the feature importance for -th feature in -th random forest. Herein, we discard the features with low feature importance by a specific ratio based on the calculated feature importance. Hence, the feature selection and classifier training are adaptively incorporated into a unified framework. Thus, the selected feature vector as the input of the next layer. Finally, we cascade multiply layers to learn the deep discriminative feature representation for COVID-19 classification task.

D. Implementation

As shown in Fig. 2, in the proposed adaptive feature selection guided deep forest, we employ a Xgboost [21] with 20 trees, a random forest [22] with 20 trees, and two extremely randomized trees [23] with 20 and 50 trees, respectively. We empirically set the feature discard ratio as 0.2 in our study. In the training stage, we feed the extracted location-specific features to the adaptive feature selection guided deep forest. The training set is further divided to 5 subsets for 5-fold cross-validation. Thus, the numbers of cascade layers and selected features are automatically determined by using the cross-validation strategy. Hence, the adaptive feature selection guided deep forest can adaptively train the feature selection and classification model in a task-driven manner. Notably, the AFS-DF is trained in a layer-to-layer manner.

In the testing stage, we also feed the location-specific features of test subject to the trained adaptive feature selection guided deep forest model. In the last layer, each forest will produce a probability distribution Inline graphic for the identification of COVID-19. For each subject, we use the following equation to ensemble the predicted value for diagnose of COVID-19,

where Inline graphic is the probability of subject belongs to category (i.e., COVID-19 or CAP) that is provided by the -th forest in last layer. Finally, we use the MAP criterion to obtain the label for each subject, i.e.,

III. Experiment

In this section, we first illustrate the competing methods and experimental settings in our study. Then, we present experimental results achieved by different methods on the chest CT images with 1495 patients of COVID-19 and 1027 patients of CAP.

A. Competing Methods

In our experiments, we compare our proposed AFS-DF with the following four widely adopted machine learning methods.

1)
Logistic Regression (LR): A Logistic regression method is employed for COVID-19 classification by using the extracted location-specific features.
2)
Support Vector Machine (SVM): The extracted location-specific features are fed into the SVM [24] classifier by using radial basis function kernel, and via cross-validation.
3)
Random Forests (RF): The random forest classifier is applied on the location-specific features for COVID-19 classification, and the number of trees in random forest is set as 500 via cross-validation.
4)
Neural Networks (NN): In this method, a fully connected neural network is employed for COVID-19 classification. Specifically, we empirically set the mini-batch size as 64, the number of epochs as 100, and the learning rate as 0.001.

B. Experimental Settings

For all extracted location-specific features from chest CT images, we first perform the normalization with center 0 and deviation 1 for each feature. In order to measure the classification performance of different methods, six evaluation metrics are adopted, including classification accuracy (ACC), sensitivity (SEN), specificity (SPE), the Area Under the receiver operating characteristic Curve (AUC) [25], precision and F1-score. Here, the ACC, SEN, SPE, precision and F1-score are defined as,

where TP, TN, FP and FN in Eqs.3–5 represent True Positive, True Negative, False Positive, and False Negative, respectively.

C. Classification Performance

We evaluate the COVID-19 vs. CAP classification on the collected chest CT images dataset. Table 2 shows the quantitative results (i.e., ACC, SEN, SPE, AUC, Precision and F1-score) achieved by different methods.

TABLE II. Performance of COVID-19 vs. CAP Classification Achieved by LR, SVM, RF, NN and AFS-DF. The Terms and in “” Denote the Mean and Standard Deviation, Respectively.

Method	ACC (%)	SEN (%)	SPE (%)	AUC (%)	Precision (%)	F1-score(%)
LR
SVM
RF
NN
AFS-DF

Open in a new tab

From Table 2, we can observe that our AFS-DF archives the best results in the terms of ACC, SEN, SPE, AUC, Precision and F1-score. In particular, our AFS-DF method achieves the highest classification accuracy (i.e., Inline graphic ), which is better than the LR (i.e., ), SVM (i.e., ), RF (i.e., ) and NN (i.e., ). In general, the proposed AFS-DF achieves , , and improvements in terms of ACC over the LR, SVM, RF and NN, respectively. We also perform the -test between AFS-DF and the competing methods. AFS-DF shows significant improvement ( Inline graphic ) over LR (-value = 0.0034), SVM (-value = 0.0009), RF (-value=0.0023) and NN (-value = 0.0275). The possible reason for improvements is that our AFS-DF not only leverages the high-level feature representation by using the deep forest to improve the performance of COVID-19 vs. CAP, but also employs the discriminative features by using the adaptive feature selection process. The deep methods (i.e., NN and AFS-DF) show better classification accuracy in task of diagnosis of COVID-19 when compared with the conventional classifiers (i.e., LR, SVM and RF). The possible reason is that, the deep models can learn the high-level feature representative, which can boost the classification performance. It is worth noting that COVID-19 is the highly contagious disease, the higher SEN should have practically meaningful advantage for timely COVID-19 diagnosis to prevent the spread of the COVID-19. The SEN achieved by our AFS-DF for COVID-19 vs. CAP is Inline graphic , which is better than other baseline methods. These results imply that using the adaptive feature selection guided deep forest model can improve the identification ability of COVID-19. We show the misclassification cases in Fig. 3. As shown in Fig. 3, the false negative cases are mostly the patients with small abnormality regions. In contrast, the false positive cases are mostly the patients with large abnormality regions.

Fig. 3. — Illustration of false negative cases and false positive cases.

As shown in Fig 4, our proposed AFS-DF method produces the best classification performance when compared with the baseline methods. These results further validate that using the high-level representation and adaptive feature selection strategy could improve the performance for COVID-19 vs. CAP classification. The top 30 important features in the extracted location-specific features are shown in Fig. 5. We calculate the average of normalized feature importance of location-specific features over the last layer in 5 trained AFS-DF model (as shown in Fig. 5(a)) and normalized feature importance of location-specific features in each trained AFS-DF model (as shown in Fig. 5(b)-Fig. 5(f)). Fig. 5 shows that the surface area features are more important on each fold for COVID-19 classification.

Fig. 5. — The top 30 important features of location-specific features in AFS-DF. (a) average importance on all trained AFS-DF and (b) (f) the importance on each trained AFS-DF.

IV. Discussion

In this section, we first compare our proposed adaptive feature selection guided deep forest with several state-of-the-art methods for COVID-19 classification. Then, we study the influence of the adaptive feature selection strategy and the selected deep features that are learned by adaptive feature selection guided deep forest. Finally, we present the limitations of this work as well as possible future research directions.

A. Comparison With State-of-the-Art Methods

Since several attempts have been made for COVID-19 classification, we now compare our proposed AFS-DF with state-of-the-art methods. [13] employs a convolutional neural networks (CNN) to extract the features of chest CT images, and then combines the decision tree and Adaboost to produce the classification result of COVID-19 vs. typical viral pneumonia (using a total of 670 CT scans). [14] leverages a ResNet [27] to predict the COVID-19, Influenza-A viral pneumonia, and healthy cases (using a total of 618 CT scans). [10] extracts clinical and laboratory features and uses a logistic regression model for non-severe and severe patient classification (using a total of 196 CT scans). [11] extracts quantitative features, and introduces a random forest method to assess the severity of the COVID-19 patient (using a total of 176 CT scans). [12] extracts location-specific handcrafted features, and proposed an infection size-adaptive random forest for COVID-19 classification (using a total of 2685 CT scans). [26] integrates the CNN and traditional machine learning method for COVID-19 vs. SARS-Cov-2 classification (using a total of 1324 CT scans).

The results are reported in Table 3. One can observe from Table 3 that the proposed AFS-DF shows competitive classification performance for COVID-19 patient identification. The underlying reason could be that our AFS-DF can utilize the high-level discriminative representation of the extracted features. It is worth noting that the deep forest-based method could handle the small-scale data, and just spend less computational resource than the deep neural network-based methods.

TABLE III. Comparison With State-of-the-Art Methods for COVID-19 Classification.

Method	ACC (%)	SEN (%)	SPE (%)	AUC (%)
Wang et al. [13]	73.1	67.0	74.0	78.0
Xu et al. [14]	86.7
Shi et al. [10]	89.0	82.2	82.8	89.0
Tang et al. [11]	87.5	93.3	74.5	91.0
Shi et al. [12]	87.9	90.7	83.3	94.2
Mei et al. [26]	83.5	84.3	82.8	92
AFS-DF	91.79	93.05	89.95	96.35

Open in a new tab

B. Influence of Adaptive Feature Selection

To study the effectiveness of the proposed adaptive feature selection, we compare it to two feature selection methods (i.e., Lasso and ElasticNet [28]) and its variant (i.e., Deep Forest (DF)). Lasso is used to select a discriminative subset of features from the feature vector by using Inline graphic -norm sparsity constraint. The parameter for the sparsity constraint in this method is set as 0.001 by using cross-validation. In ElasticNet, a -norm is leveraged to reduce dimension of extracted features. Also, a -norm is further introduced into the classification model to ensure the smoothness of the linear model. The parameters for the sparsity constraint and smoothness constraint in ElasticNet are set as 0.001 and 0.1 by using cross-validation, respectively. DF is a variant of the proposed AFS-DF, which employs the same initial architecture of our AFS-DF model without feature selection block. Hence, the predictions of each layer are concatenated with all original features, which will be fed into next layer for prediction. Note that this variant method has same parameter setting with AFS-DF for training and test in our study. The experimental results are reported in Table 4.

TABLE IV. Performance of COVID-19 vs. CAP Classification by Using Lasso, ElasticNet, DF and AFS-DF.

Method	ACC (%)	SEN (%)	SPE (%)	Precision (%)	F1-score (%)
Lasso
ElasticNet
DF
AFS-DF

Open in a new tab

As shown in Table 4, the proposed AFS-DF achieves the best classification performance, when compared with Lasso, ElasticNet and DF. In particular, compared with the feature selection methods (i.e., Lasso and ElasticNet), the proposed AFS-DF achieves better performance by using the high-level feature representation. Meanwhile, by using the adaptive feature selection operation, AFS-DF achieves better performance when compared with DF. These results imply the effectiveness of the proposed AFS-DF. Besides, as can be seen from Table 2, Table 3 and Table 4, the feature selection methods (i.e., Lasso, ElasticNet and proposed AFS-DF) show the competitive results. The possible reason is that the extracted features by using the deep learning methods or handcraft features may degrade classification performance due to heterogeneity between extracted features and subsequent traditional classification algorithms (i.e., decision tree, Adaboost, etc.). While the feature selection methods could select the features that are more relevant to the COVID-19 classification for subsequent classification task.

C. Influence of Features

To evaluate the effectiveness of the selected deep features (e.g. the features used in the last layer of AFS-DF) for COVID-19 vs. CAP classification, we further develop three methods based on LR, SVM and RF by using the selected deep features (i.e., AFSDF-LR, AFSDF-SVM and AFSDF-RF). Meanwhile, we also compare these methods with LR, SVM and RF based on the features used in deep forest (i.e., DF-LR, DF-SVM and DF-RF). We evaluate these nine methods for COVID-19 vs. CAP classification, with the results reported in Table 5.

TABLE V. Performance of COVID-19 vs. CAP Classification by Using LR, DF-LR, AFSDF-LR, SVM, DF-SVM, AFSDF-SVM, RF DF-RF and AFSDF-RF.

Method	ACC (%)	SEN (%)	SPE (%)	Precision (%)	F1-score (%)
LR
DF-LR
AFSDF-LR
SVM
DF-SVM
AFSDF-SVM
RF
DF-RF
AFSDF-RF

Open in a new tab

As can be seen from Table 5, the proposed AFSDF-LR, AFSDF-SVM and AFSDF-RF outperform their counterparts (i.e., LR, SVM and RF) in most of evaluation metrics. Of note, the proposed methods consistently achieve better results in terms of ACC and SEN. For example, AFSDF-LR, AFSDF-SVM and AFSDF-RF achieve Inline graphic , and improvement over LR, SVM and RF in terms of ACC for COVID-19 vs. CAP classification, respectively. Compared with LC, SVM and RF, the AFSDF-LR, AFSDF-SVM and AFSDF-RF also show improvement in terms of SEN for COVID-19 vs. CAP classification, respectively. The possible reason is that, with the selected deep features by using AFS-DF, the features include the high-level and discriminative information. Hence, the conventional machine learning methods (i.e., LR, SVM and RF) can use these selected deep features to improve the performance of COVID-19 classification task. In addition, the AFSDF-based methods achieve better classification results over the DF-based methods (i.e., DF-LR, DF-SVM and DF-RF). Besides, In Fig 6, we plot the original location-specific features, the features of the last layer in DF, and the features of the last layer in AFS-DF, which performs dimensionality reduction by using t-SNE [29]. As shown in Fig 6, our AFS-DF produces more discriminative features for COVID-19 vs. CAP classification. Although the deep forest could learn the high-level features, the learned features are still difficult to classify.

Fig. 6. — Visual illustration of original location-specific features (a), the features of the last layer in DF (b), and the features of the last layer in AFS-DF (c), which uses the t-SNE to do the feature dimensionality reduction.

D. Limitations and Future Work

There are still several limitations in the current study. First, the adaptive feature selection guided deep forest is only validated on COVID-19 vs. CAP classification task. In the future, we plan to collect more data with multiple diseases, and perform our proposed method on other COVID-19 classification tasks (e.g., COVID-19 vs. Influenza-A viral pneumonia and CAP, severe patients vs. non-severe patients, etc.). Second, we extract the handcraft features by using prior knowledge in current work, in future, the features learned by deep learning method are expected to leverage our proposed method for further performance improvement.

V. Conclusion

In this paper, we propose an adaptive feature selection guided deep forest for COVID-19 vs. CAP classification by using the chest CT images. Specifically, the AFS-DF uses the deep forest to learn the high-level representation based on the location-specific features. Meanwhile, an adaptive feature selection operation is employed to reduce the redundancy of features based on the trained forest. Experimental results on the collected COVID-19 dataset with 1495 COVID-19 cases and 1027 CAP cases show that our proposed AFS-DF approach can achieve superior performance on COVID-19 classification with chest CT images in comparison with several existing methods.

Funding Statement

This work was supported in part by the National Key Research and Development Program of China under Grants 2018YFC2001600 and 2018YFC2001602, in part by the National Natural Science Foundation of China under Grants 61876082, 61861130366, 61732006, 61902183, and 81871337, in part by the Royal Society-Academy of Medical Sciences Newton Advanced Fellowship under Grant NAF\R1\180371, in part by China Postdoctoral Science Foundation funded project under Grant 2019M661831, in part by Wuhan Science and technology program under Grant 2018060401011326, in part by Hubei Provincial Novel Pneumonia Emergency Science and Technology Project under Grant 2020FCA021, in part by Huazhong University of Science and Technology Novel Coronavirus Pneumonia Emergency Science and Technology Project under Grant 2020kfyXGYJ014, and in part by the Novel Coronavirus Special Research Foundation of the Shanghai Municipal Science and Technology Commission under Grant 20441900600.

Contributor Information

Liang Sun, Email: sunl@nuaa.edu.cn.

Zhanhao Mo, Email: mozhanhao@jlu.edu.cn.

Fuhua Yan, Email: yfh11655@rjh.com.cn.

Liming Xia, Email: xialiming2017@outlook.com.

Fei Shan, Email: shanfei_2901@163.com.

Zhongxiang Ding, Email: hangzhoudzx73@126.com.

Bin Song, Email: anicesong@vip.sina.com.

Wanchun Gao, Email: 13908272019@163.com.

Wei Shao, Email: 527606857@qq.com.

Feng Shi, Email: feng.shi@united-imaging.com.

Huan Yuan, Email: huan.yuan@united-imaging.com.

Huiting Jiang, Email: huiting.jiang@united-imaging.com.

Dijia Wu, Email: dijia.wu@united-imaging.com.

Ying Wei, Email: ying.wei@united-imaging.com.

Yaozong Gao, Email: yaozong.gao@united-imaging.com.

He Sui, Email: suihe910402@126.com.

Daoqiang Zhang, Email: dqzhang@nuaa.edu.cn.

Dinggang Shen, Email: dinggang.shen@gmail.com.

References

[1].Zu Z. Y. et al. , “Coronavirus disease 2019 (COVID-19): A perspective from China,” Radiology, vol. 296, no. 2, pp. E15–E25, 2020, Art. no. 200490. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Jeong E. K. et al. , “Coronavirus disease-19: The first 7,755 cases in the Republic of Korea,” Osong Public Health and Research Perspectives, vol. 11, no. 2, 2020, doi: 10.24171/j.phrp.2020.11.2.05. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Sohrabi C. et al. , “World Health Organization declares global emergency: A review of the 2019 novel coronavirus (covid-19),” Int. J. Surg., vol. 76, pp. 71–76, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Wu J. et al. , “Chest CT findings in patients with corona virus disease 2019 and its relationship with clinical features,” Invest Radiol., vol. 55, no. 5, pp. 257–261, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Fang Y. et al. , “Sensitivity of chest CT for COVID-19: Comparison to RT-PCR,” Radiology, vol. 296, no. 2, pp. E115–E117, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Li Y. and Xia L., “Coronavirus disease 2019 (COVID-19): Role of chest CT in diagnosis and management,” Amer. J. Roentgenol., vol. 214, no. 6, pp. 1280–1286, 2020. [DOI] [PubMed] [Google Scholar]
[7].Chung M. et al. , “CT imaging features of 2019 novel coronavirus (2019-nCoV),” Radiology, vol. 295, no. 1, pp. 202–207, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Li M. et al. , “Coronavirus disease (covid-19): Spectrum of CT findings and temporal progression of the disease,” Academic Radiol., vol. 27, no. 5, pp. 603–608, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Long C. et al. , “Diagnosis of the coronavirus disease (covid-19): rRT-PCR or CT?” Eur. J. Radiol., vol. 162, 2020, Art. no. 108961. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Shi W. et al. , “Deep learning-based quantitative computed tomography model in predicting the severity of COVID-19: A retrospective study in 196 patients,” 2020, doi: 10.2139/ssrn.3546089. [DOI] [PMC free article] [PubMed]
[11].Tang Z. et al. , “Severity assessment of coronavirus disease 2019 (COVID-19) using quantitative features from chest CT images,” 2020, arXiv:2003.11988.
[12].Shi F. et al. , “Large-scale screening of COVID-19 from community acquired pneumonia using infection size-aware classification,” 2020, arXiv:2003.09860. [DOI] [PubMed]
[13].Wang S. et al. , “A deep learning algorithm using CT images to screen for Corona Virus Disease (covid-19),” 2020, MedRxiv. [DOI] [PMC free article] [PubMed]
[14].Xu X. et al. , “Deep learning system to screen coronavirus disease 2019 pneumonia,” 2020, arXiv:2002.09334.
[15].Zhou Z.-H. and Feng J., “Deep forest: Towards an alternative to deep neural networks,” in Proc. 26th Int. Joint Conf. Artif. Intell., 2017, pp. 3553–3559. [Google Scholar]
[16].Shan F. et al. , “Lung infection quantification of COVID-19 in CT images with deep learning,” 2020, arXiv:2003.04655.
[17].Hansell D. M., Bankier A. A., MacMahon H., McLoud T. C., Muller N. L., and Remy J., “Fleischner Society: Glossary of terms for thoracic imaging,” Radiology, vol. 246, no. 3, pp. 697–722, 2008. [DOI] [PubMed] [Google Scholar]
[18].Li X., Zeng X., Liu B., and Yu Y., “Covid-19 infection presenting with CT Halo Sign,” Radiol.: Cardiothor. Imag., vol. 2, no. 1, 2020, Paper e200026. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Bernheim A. et al. , “Chest CT findings in coronavirus disease-19 (COVID-19): Relationship to duration of infection,” Radiology, vol. 295, no. 3, 2020, Art. no. 200463. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Song F. et al. , “Emerging 2019 novel coronavirus (2019-nCoV) pneumonia,” Radiology, vol. 295, no. 1, pp. 210–217, 2020, Art. no. 200274. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Chen T. and Guestrin C., “Xgboost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016, pp. 785–794. [Google Scholar]
[22].Liaw A. et al. , “Classification and regression by randomForest,” R News, vol. 2, no. 3, pp. 18–22, 2002. [Google Scholar]
[23].Geurts P., Ernst D., and Wehenkel L., “Extremely randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006. [Google Scholar]
[24].Fan Y. et al. , “Multivariate examination of brain abnormality using both structural and functional mri,” Neuroimage, vol. 36, no. 4, pp. 1189–1199, 2007. [DOI] [PubMed] [Google Scholar]
[25].Fletcher G. S., Clinical Epidemiology: The Essentials. Baltimore, MD, USA: Williams & Wilkins, vol. 26, 1224–1228, 2019. [Google Scholar]
[26].Mei X. et al. , “Artificial intelligence–enabled rapid diagnosis of patients with covid-19,” Nat. Med., pp. 1–5, 2020. [DOI] [PMC free article] [PubMed]
[27].He K., Zhang X., Ren S., and Sun J., “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778. [Google Scholar]
[28].Zou H. and Hastie T., “Regularization and variable selection via the elastic net,” J. Royal Stat. Soc.: Series B (Stat. Methodol.), vol. 67, no. 2, pp. 301–320, 2005. [Google Scholar]
[29].Maaten L. V. D. and Hinton G., “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579–2605, 2008. [Google Scholar]

[ref1] [1].Zu Z. Y. et al. , “Coronavirus disease 2019 (COVID-19): A perspective from China,” Radiology, vol. 296, no. 2, pp. E15–E25, 2020, Art. no. 200490. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] [2].Jeong E. K. et al. , “Coronavirus disease-19: The first 7,755 cases in the Republic of Korea,” Osong Public Health and Research Perspectives, vol. 11, no. 2, 2020, doi: 10.24171/j.phrp.2020.11.2.05. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] [3].Sohrabi C. et al. , “World Health Organization declares global emergency: A review of the 2019 novel coronavirus (covid-19),” Int. J. Surg., vol. 76, pp. 71–76, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] [4].Wu J. et al. , “Chest CT findings in patients with corona virus disease 2019 and its relationship with clinical features,” Invest Radiol., vol. 55, no. 5, pp. 257–261, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] [5].Fang Y. et al. , “Sensitivity of chest CT for COVID-19: Comparison to RT-PCR,” Radiology, vol. 296, no. 2, pp. E115–E117, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] [6].Li Y. and Xia L., “Coronavirus disease 2019 (COVID-19): Role of chest CT in diagnosis and management,” Amer. J. Roentgenol., vol. 214, no. 6, pp. 1280–1286, 2020. [DOI] [PubMed] [Google Scholar]

[ref7] [7].Chung M. et al. , “CT imaging features of 2019 novel coronavirus (2019-nCoV),” Radiology, vol. 295, no. 1, pp. 202–207, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] [8].Li M. et al. , “Coronavirus disease (covid-19): Spectrum of CT findings and temporal progression of the disease,” Academic Radiol., vol. 27, no. 5, pp. 603–608, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] [9].Long C. et al. , “Diagnosis of the coronavirus disease (covid-19): rRT-PCR or CT?” Eur. J. Radiol., vol. 162, 2020, Art. no. 108961. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] [10].Shi W. et al. , “Deep learning-based quantitative computed tomography model in predicting the severity of COVID-19: A retrospective study in 196 patients,” 2020, doi: 10.2139/ssrn.3546089. [DOI] [PMC free article] [PubMed]

[ref11] [11].Tang Z. et al. , “Severity assessment of coronavirus disease 2019 (COVID-19) using quantitative features from chest CT images,” 2020, arXiv:2003.11988.

[ref12] [12].Shi F. et al. , “Large-scale screening of COVID-19 from community acquired pneumonia using infection size-aware classification,” 2020, arXiv:2003.09860. [DOI] [PubMed]

[ref13] [13].Wang S. et al. , “A deep learning algorithm using CT images to screen for Corona Virus Disease (covid-19),” 2020, MedRxiv. [DOI] [PMC free article] [PubMed]

[ref14] [14].Xu X. et al. , “Deep learning system to screen coronavirus disease 2019 pneumonia,” 2020, arXiv:2002.09334.

[ref15] [15].Zhou Z.-H. and Feng J., “Deep forest: Towards an alternative to deep neural networks,” in Proc. 26th Int. Joint Conf. Artif. Intell., 2017, pp. 3553–3559. [Google Scholar]

[ref16] [16].Shan F. et al. , “Lung infection quantification of COVID-19 in CT images with deep learning,” 2020, arXiv:2003.04655.

[ref17] [17].Hansell D. M., Bankier A. A., MacMahon H., McLoud T. C., Muller N. L., and Remy J., “Fleischner Society: Glossary of terms for thoracic imaging,” Radiology, vol. 246, no. 3, pp. 697–722, 2008. [DOI] [PubMed] [Google Scholar]

[ref18] [18].Li X., Zeng X., Liu B., and Yu Y., “Covid-19 infection presenting with CT Halo Sign,” Radiol.: Cardiothor. Imag., vol. 2, no. 1, 2020, Paper e200026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] [19].Bernheim A. et al. , “Chest CT findings in coronavirus disease-19 (COVID-19): Relationship to duration of infection,” Radiology, vol. 295, no. 3, 2020, Art. no. 200463. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] [20].Song F. et al. , “Emerging 2019 novel coronavirus (2019-nCoV) pneumonia,” Radiology, vol. 295, no. 1, pp. 210–217, 2020, Art. no. 200274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] [21].Chen T. and Guestrin C., “Xgboost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016, pp. 785–794. [Google Scholar]

[ref22] [22].Liaw A. et al. , “Classification and regression by randomForest,” R News, vol. 2, no. 3, pp. 18–22, 2002. [Google Scholar]

[ref23] [23].Geurts P., Ernst D., and Wehenkel L., “Extremely randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006. [Google Scholar]

[ref24] [24].Fan Y. et al. , “Multivariate examination of brain abnormality using both structural and functional mri,” Neuroimage, vol. 36, no. 4, pp. 1189–1199, 2007. [DOI] [PubMed] [Google Scholar]

[ref25] [25].Fletcher G. S., Clinical Epidemiology: The Essentials. Baltimore, MD, USA: Williams & Wilkins, vol. 26, 1224–1228, 2019. [Google Scholar]

[ref26] [26].Mei X. et al. , “Artificial intelligence–enabled rapid diagnosis of patients with covid-19,” Nat. Med., pp. 1–5, 2020. [DOI] [PMC free article] [PubMed]

[ref27] [27].He K., Zhang X., Ren S., and Sun J., “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778. [Google Scholar]

[ref28] [28].Zou H. and Hastie T., “Regularization and variable selection via the elastic net,” J. Royal Stat. Soc.: Series B (Stat. Methodol.), vol. 67, no. 2, pp. 301–320, 2005. [Google Scholar]

[ref29] [29].Maaten L. V. D. and Hinton G., “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579–2605, 2008. [Google Scholar]

PERMALINK

Adaptive Feature Selection Guided Deep Forest for COVID-19 Classification With Chest CT

Liang Sun

Zhanhao Mo

Fuhua Yan

Liming Xia

Fei Shan

Zhongxiang Ding

Bin Song

Wanchun Gao

Wei Shao

Feng Shi

Huan Yuan

Huiting Jiang

Dijia Wu

Ying Wei

Yaozong Gao

He Sui

Daoqiang Zhang

Dinggang Shen

Abstract

I. Introduction

Fig. 1.

II. Materials and Method

A. Materials

TABLE I. Demographic Information of the Studied 2522 Chest CT Scans From COVID-19 Dataset. M/F: Male/Female.

B. Feature Extraction

C. Adaptive Feature Selection Guided Deep Forest

Fig. 2.

D. Implementation

III. Experiment

A. Competing Methods

B. Experimental Settings

C. Classification Performance

TABLE II. Performance of COVID-19 vs. CAP Classification Achieved by LR, SVM, RF, NN and AFS-DF. The Terms and in “” Denote the Mean and Standard Deviation, Respectively.

Fig. 3.

Fig. 4.

Fig. 5.

IV. Discussion

A. Comparison With State-of-the-Art Methods

TABLE III. Comparison With State-of-the-Art Methods for COVID-19 Classification.

B. Influence of Adaptive Feature Selection

TABLE IV. Performance of COVID-19 vs. CAP Classification by Using Lasso, ElasticNet, DF and AFS-DF.

C. Influence of Features

TABLE V. Performance of COVID-19 vs. CAP Classification by Using LR, DF-LR, AFSDF-LR, SVM, DF-SVM, AFSDF-SVM, RF DF-RF and AFSDF-RF.

Fig. 6.

D. Limitations and Future Work

V. Conclusion

Funding Statement

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases