Skip to main content
Journal of Medical Imaging logoLink to Journal of Medical Imaging
. 2021 Aug 18;8(4):044502. doi: 10.1117/1.JMI.8.4.044502

Lung nodule malignancy classification with weakly supervised explanation generation

Aniket Joshi a,*, Jayanthi Sivaswamy a, Gopal Datt Joshi b
PMCID: PMC8370883  PMID: 34423071

Abstract.

Purpose: Explainable AI aims to build systems that not only give high performance but also are able to provide insights that drive the decision making. However, deriving this explanation is often dependent on fully annotated (class label and local annotation) data, which are not readily available in the medical domain.

Approach: This paper addresses the above-mentioned aspects and presents an innovative approach to classifying a lung nodule in a CT volume as malignant or benign, and generating a morphologically meaningful explanation for the decision in the form of attributes such as nodule margin, sphericity, and spiculation. A deep learning architecture that is trained using a multi-phase training regime is proposed. The nodule class label (benign/malignant) is learned with full supervision and is guided by semantic attributes that are learned in a weakly supervised manner.

Results: Results of an extensive evaluation of the proposed system on the LIDC-IDRI dataset show good performance compared with state-of-the-art, fully supervised methods. The proposed model is able to label nodules (after full supervision) with an accuracy of 89.1% and an area under curve of 0.91 and to provide eight attributes scores as an explanation, which is learned from a much smaller training set. The proposed system’s potential to be integrated with a sub-optimal nodule detection system was also tested, and our system handled 95% of false positive or random regions in the input well by labeling them as benign, which underscores its robustness.

Conclusions: The proposed approach offers a way to address computer-aided diagnosis system design under the constraint of sparse availability of fully annotated images.

Keywords: CAD, lung nodule, malignancy, explanability

1. Introduction

Classification of lung nodules from a computed tomography (CT) volume as benign/malignant is a laborious and time consuming task even for a well experienced radiologist. Several automated solutions including handcrafted feature-based and deep learning (DL)-based methods, have been explored to address this problem (Xie et al.,1 Zhao et al.,2 Causey et al.,3 Shen et al.,4 and Li et al.5).

The focus of most DL-based systems has been on improving the classification accuracy. While accuracy is a primary concern, another factor that is equally important in translating solutions from a research lab to the field (clinic, hospital, etc.) is the explainability of the classification outcome. This has not been examined sufficiently by the majority of classification systems in the literature. Thus, these systems are black boxes (with no explicit and declarative knowledge representation) without much ability to generate the underlying explanatory structures. The task that we address in this paper is the classification of a lung nodule in a CT volume as benign or malignant. In clinical practice, it has been suggested that the morphology of a nodule is a key indicator of its type (benign or malignant) (Snoeckx et al.6). Hence, a set of semantic features that can be used to characterize the appearance of a nodule in a CT volume was proposed in terms of the clarity of the nodule boundary (margin), shape (sphericity), intensity (subtlety or contrast), and composition (calcification). Eight such semantic features were scored by radiologists, and a system was trained to learn to automatically derive these scores in works by Chen et al.7 and Zou et al.8 These predicted scores were given as inputs to aid diagnostic decisions by experts. More recently, the problem of nodule classification has been cast as a multi-task problem, and DL solutions have been proposed to predict the class (benign/malignant) and scores (semantic features) for a nodule. The relationship between morphological attributes of the nodules and malignancy classification was studied by Hussein et al.,9 Chen et al.,10 and Botong et al.11 Hussein et al.9 used graph regularized sparse multi-task learning to fuse 3D CNN based attribute features and to stratify the malignancy of a lung nodule. Chen et al.10 proposed a multi task learning scheme for effectively bridging the gap between the computational features (derived from deep models) and multiple clinical semantic features, which otherwise is one of the major bottlenecks of using computer-aided diagnosis (CAD) systems in a clinical setup. Botong et al.11 proposed an interpretable multi task CNN that does joint learning of three tasks: nodule segmentation, malignancy prediction, and semantic high-level attribute scoring. They claim that the combined learning for the tasks improved the performance of individual tasks. Liu et al.12 proposed a solution for malignancy classification with attribute score regression using a Siamese network design combined with a margin ranking loss and explored the relation between the nodule malignancy and attribute scores in a cause and effect manner. Shen et. al13 proposed a hierarchical, semantic CNN to perform two-stage classification wherein the first stage characterized the diagnostic semantic attributes and the second stage derives the nodule malignancy score.

Thus, from the early attempts in addressing the problem of nodule classification as a binary classification task, research has progressed to pose it as a multi-task learning of its semantic attributes with scores to assist the final classification. This shift attests to the challenging nature of the problem and signals the need for a basis for a decision even within a data-driven paradigm such as DL. Specifically, the morphological characteristics of the nodule captured in the CT image appear to form the basis for its likelihood to be malignant or benign. We revisit this problem and seek to design a system that performs the malignancy prediction and generates a supporting explanation, namely, semantic attributes by learning from limited annotations. Ideally, it would be good to learn to generate explanations for decisions in an unsupervised manner; however, with the nature of the explanation for lung nodules being semantic attribute scores, it is impossible to do so. Hence, we aim to learn it with weak supervision. (In this paper, weak supervision signifies incomplete, limited, or partially annotated training data.) The constraint of limited annotation is a hard reality in the medical domain, where it is relatively easier to source images with decisions (global annotation as normal/abnormal or benign/malignant, in our case) than obtain reasons for the decisions (local annotations), even a basic one such as a region of interest in the image. With the current problem, it is easier to source labeled images for nodule classification than to source nodule locations or scores for the semantic features of the nodules. This is because local annotations, such as manual localization, and scoring are labor intensive. Hence, our interest is to design a system for nodule classification that can be trained with N nodule images (extracted from CT volumes) in which every nodule has a class label but only MN have information about the nodule morphology. This is a challenging problem to solve as the quantum of training data has a serious impact on the performance of supervised DL systems.

We propose a solution wherein a model learns to label a given nodule image under full supervision and learns to generate an explanation under weak supervision. The key contributions of this work include (i) a novel architecture and training regime for lung nodule classification with an explanation and (ii) demonstration of a methodology to design an explainable computer-aided diagnostic system that can accommodate limited availability of locally annotated data. An extensive set of experiments as described in Sec. 3 was performed to evaluate the proposed solution on several key aspects. In the next section, we present the proposed neural architecture and learning approach.

2. Deep Network for Explainable Malignancy Classification System

2.1. Model Architecture

A small and custom-designed architecture is devised to solve the problem at hand instead of using deep architectures such as ResNet or DenseNet. The size of the nodules in the LIDC-IDRI14 dataset generally is under 40×40×40, which is significantly small relative to natural images, i.e., those used to train deep architectures such as ResNet or DenseNet. Using such deep networks that are pre-trained on millions of natural images can also lead to over-fitting on a small dataset such as LIDC-IDRI.

The basic principle behind the proposed design is to enable malignancy prediction that is guided by the semantic attributes. Figure 1 shows the proposed complete network architecture, which consists of three sub-networks: feature pooling network, malignancy prediction branch, and auxiliary attribute prediction branch. The feature pooling network along with the malignancy prediction branch will be referred to as the primary network, and the feature pooling network along with auxiliary attribute prediction branch will be referred to as the attribute (Att) network. Since malignancy prediction is to be learned in a fully supervised manner whereas the semantic attributes are to be learned in a weakly supervised manner, two separate branches were designed for each of these tasks. A hierarchical representation of the nodule is desirable to learn features that are common for both tasks to gather both the local and global information present in the lung nodule. A sub-network referred to as the feature pooling network was designed for this purpose.

Fig. 1.

Fig. 1

Proposed network for malignancy and semantic attribute prediction for lung nodules.

The feature pooling network has three input streams generated from the nodule region (40×40×40) of the CT scan to represent the nodule in a hierarchical manner, namely, its centroid region, full nodule, and full nodule+background tissue. The centroid region/slice is most informative (Yan and Pang,15 Jung and Kim16). The remaining sub-volumes provide contextual and morphological information needed for malignancy classification as well as the prediction of attributes. Accordingly, the inputs to the feature pooling network are (see Fig. 1): (a) middle slice of size (40×40×1), (b) a sub-volume of eight middle slices of size (40×40×8), and (c) a sub-volume of 16 middle slices of size (40×40×16). Each parallel unit in the feature pooling network consists of four convolution blocks, and each block has two convolutional layers followed by the max pooling layer (blue arrow in 1). Max pooling is skipped in the central convolution block to maintain the size of the feature map (red arrow in 1). The feature map at the end of each parallel unit of size (5×5×512) is concatenated to get a feature map of size (5×5×1536). The concatenation of the features helps to gather both the local and global information present in the inputs and to learn the semantic and appearance features simultaneously.

The output of the feature pooling network is fed to two branches: a malignancy prediction branch and an auxiliary attribute prediction branch. Both branches have a convolutional layer followed by a dense layer. As there are eight different attribute outputs in the auxiliary attribute prediction branch, eight different dense layers are attached, and at the end of each one, there is a semantic attribute prediction. A sigmoid activation function is chosen for the last layer for the purpose of classification of attributes and is changed to the linear activation function for the estimation/regression of the attribute values, as desribed in the next section. The feature map derived by the auxiliary attribute prediction branch is concatenated to the feature map in the malignancy prediction branch after their respective convolution layers to ensure that the malignancy decision is guided by the attribute features. In other words, the morphological characteristics are made to influence the decision about the malignancy of a nodule.

2.2. Training Methodology

The network has two tasks to learn in which one is to be done with full supervision and the other with weak supervision. Hence, the training set (see Table 1) was partitioned into two separate mutually exclusive sets: a set (Sf) of nodules with full annotation, i.e., both malignancy class label and attribute scores, and a set (Sp) of nodules having partial annotation, i.e., only malignancy class label. Our interest is to get good performance with a Sf with a size that is as small as possible.

Table 1.

Distribution of nodule data, their classes and semantic attributes for nodules in the LIDC-IDRI dataset—Only 4 (of the 8) attributes, namely, margin, sphericity, subtlety, and calcification are listed for illustration due to space constraints.

Data type Class label Margin Sphericity Subtlety Calcification
Benign Malignant Poorly defined Sharp Linear Round Extremely subtle Obvious Presence Absence
Train 1261 401 411 1251 406 1256 530 1132 205 1457
Validation 291 93 96 288 103 281 128 256 52 332
Test 388 124 124 388 138 374 159 353 68 444

Phase-wise learning is known to produce more generalized learned representations leading to a superior performance (Barshan et al.17). We take a cue from this and feed the learned features in one stage as an input to the next stage in a phase-wise manner so that we have a regularization effect. Training was designed to be carried out in four phases for 30, 50, 40, and 80 epochs, respectively. The number of epochs for each individual phase was empirically determined based on the convergence of the validation loss and our stopping criteria.

In the first phase of training, the entire network including the primary and Att networks is trained to enable the feature pooling network to learn the features for both malignancy and attribute classification/estimation while the auxiliary attribute prediction branch learns specific features for different attributes. In the second phase, the primary network is trained to further boost the accuracy of the malignancy classification. In the third phase, the Att network is trained to enable learning of semantic attributes by the feature pooling network for attribute classification/estimation. In the last phase, only the malignancy prediction branch is trained while keeping all of the other network weights constant to fine tune the weights of the malignancy prediction branch to leverage the optimization carried out in the third phase. Nodules from Sp were used in the phases in which only the primary network or malignancy prediction branch was trained, whereas in all other phases, nodules from Sf was used. More specifically, Sf was used in the first and third phase of the training, and Sp was used in the second and fourth phase.

We train for two attribute-related tasks—(i) attribute classification on attribute class labels obtained using a pre-defined threshold as explained in Sec 3.1 and (ii) estimation of the value of the attribute by regression to present more fine-grained detailing of the semantic attributes. The activation function in the output layer of the auxiliary attribute prediction branch is varied depending on which task is being considered: a sigmoid function for (i) and a linear function for (ii). The loss function for these tasks is also changed as shown in Table 2.

Table 2.

Loss functions used in different attribute related prediction tasks.

Task Malignancy classification and attribute classification Malignancy classification and attribute estimation
Training phase Loss functions used —
Phase 1 BCEmc and BCEac BCEmc and MSEae
Phase 2 BCEmc BCEmc
Phase 3 BCEac MSEae
Phase 4 BCEmc BCEmc

Note: Abbrevations used—BCE: binary cross-entropy loss, MSE: mean squared error loss, mc: malignancy classification, ac: attribute classification, ae: attribute estimation.

A weighted sum of losses [binary cross-entropy (BCE) loss for malignancy classification and mean squared error or BCE loss for attribute prediction] is used as the loss function throughout the training. The weight for malignancy classification is considered with a value of 1, whereas the weights for the loss in other attributes are determined based on their class distribution in the training data to handle class imbalance, as seen in Table 1. In phases 2, 3, and 4 of the training, the loss is not back propagated and is considered to be 0 in the respective branches. Training is performed with randomly initialized network weights. The Adam optimizer with default parameters β1=0.9, β2=0.999, ε=1e07 is used with a learning rate of 0.01 and batch size of 32. The learning rate is reduced by a factor of 5 if the validation loss stagnates for a pre-defined number of epochs. Training is stopped (stopping criteria) if the average change in the validation loss over five epochs is less than 1/100th of the current loss, i.e., the weighted sum of losses. Table 1 summarizes the distribution of images used for training, validation, and testing.

3. Experimental Results

3.1. Dataset

The Lung Image Database Consortium dataset (LIDC-IDRI)14 is used for our experimentation. The dataset contains 1018 CT scans, and annotations were provided by a maximum of four radiologists for each scan. Nodules marked with diameter 3  mm are considered for nodule region extraction. In a CT scan, each annotated nodule is given a malignancy score along with scores for each of the eight attributes (margin, calcification, sphericity, spiculation, lobulation, texture, subtlety, and internal structure). The malignancy score ranges from 1 to 5, where 1 denotes least malignancy (i.e., benign) and 5 denotes highly malignant. The significance of the scores assigned for each of the attributes is shown in Table 3.

Table 3.

Semantic attributes, their significance, and scores for lung nodules.

Attribute Significance Rating [range], threshold
Margin How well defined the margin is Poorly defined, near poorly defined, medium margin, near sharp, sharp [1-5], 3
Spiculation How spiculated the nodule is No spiculation, nearly no spiculation medium spiculation, near marked spiculation marked spiculation [1-5], 3
Calcification Pattern of calcification Popcorn, laminated, solid, non-central central, absent [1-6], 5
Sphericity Shape of nodule Linear, ovoid/linear, ovoid, ovoid/round round [1-5], 3
Lobulation How irregular the margin is No lobulation, nearly no lobulation, medium lobulation, near marked lobulation marked lobulation [1-5], 3
Internal structure Internal composition of nodule Soft tissue, fluid, fat, air [1-4], 3
Subtlety Difference between outside tissue and nodule Extremely subtle, moderately subtle, fairly subtle, Moderately obvious obvious [1-5], 3
Texture Nodule’s texture Non-solid, non-solid, partially solid, solid, solid [1-5], 3

The malignancy score provided by the experts’ is averaged, and the nodule is labeled as malignant if the average score is >3 and benign otherwise. For the classification of the attributes, the average attribute score was binarized using a threshold as shown in the Table 3. The threshold for all attributes is taken as three except for the calcification and internal structure attributes to make a clear distinction between the generated classes based on their physical meanings as described in the Rating column of this table. For the estimation of actual attribute values, the averaged attribute score was scaled and normalized from 0 to 1 and taken as the ground truth score. For instance, if the scores given by experts for a particular attribute are 3, 4, 3, and 5, then the ground truth score for the estimation of that attribute is taken to be (15/4)/5=0.75.

3.2. Preprocessing

The slice thickness across CT scans in the LIDC-IDRI dataset varied from 0.45 to 5 mm. Hence, each scan was re-sampled to a fixed resolution of 1  mm/voxel along all three axes (isotropic along all three dimensions). A pre-defined volume of 40×40×40 was extracted around the center of each nodule marked by the radiologist(s). It was observed that almost all of the nodules could fit into a patch of size 20×20. A patch of size 40×40 was chosen to accommodate all of the nodules and to provide sufficient background information. Typically, the voxel values in the extracted volume are from 3000 to 2000 HU. The windowing operation was performed to filter out the air and bone region using a window width of 1200 and window center 400. The range of values obtained after filtering was from 1000 to 200 HU. This was normalised to the range of 0 to 1 before being fed to the network.

The slice of the marked nodule having the largest area of the connected component extracted from the segmentation map provided for each nodule in the dataset is considered the central slice of the nodule for further processing. The sub-volumes with 8 or 16 slices denote the total number of slices, i.e., the central and its neighbouring slices. The size of the sub-volumes was chosen to be an even number only for computational convenience. Each input nodule was randomly shifted and rotated to increase the training set size and to avoid over-fitting.

3.3. Results

A set of experiments was designed to assess the impact of various factors on the performance of the proposed method for malignancy classification and generation of semantic attributes. A five-fold cross validation was done, and the average values of accuracy, area under receiver operating curve (AUC), sensitivity, and specificity were computed and reported for most experiments.

3.3.1. Semantic attribute generation

Since our main aim is to do this task with weak supervision, an experiment was done to determinine the level of supervision required for attribute generation. Training was done with various sizes for the sets of nodule images with full (Sf) and partial (Sp) annotations, respectively. Let the ratio of the number of nodule images in Sf versus Sp be 1:N. For instance, if the training set has a total of 100 images, a ratio 1:3 indicates that there will be 25 nodules in Sf and 75 nodules in Sp. This means that the attribute score prediction task is to be learned with supervision from just a quarter of the training data. Thus the ratio represents the degree of weak supervision. A large N would imply very weak supervision and require the network to learn from very few examples. Training was done for N=3, 4, and 5 as we wished to minimise the need for fully annotated images. N<3 was not considered because it is inconsistent with the design goal. The resulting malignancy and attribute classification results are presented in Table 4. As expected, there is a degradation in the performance with increasing N, with a greater decline for N=5. The decline is sharper in attribute prediction because the sparsity in attribute features (as N increases) affects learning of that task by the attribute prediction branch of the network. The decline rate for malignancy prediction is lower because the role of the attribute features is only of guidance for the learning of this task by the malignancy prediction branch of the network. While the malignancy classification is better for N=4 than N=3, the attribute prediction performance is better for N=3. We chose N=3 as the desired ratio for all our subsequent experiments as accurate explanations are important in building confidence among the end users (medical community) in the malignancy prediction of our model.

Table 4.

Malignancy and attribute prediction performance. Training data ratio (DR) 1:N indicates that the number of nodules with partial (only malignancy label) annotation was N times those with full annotation (malignancy label + attribute information).

Task Malignancy Spi Sub Mar Tex Int Cal Lob Sph
DR Acc AUC Sens Spec Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC
1:3 89.1 0.91 88.5 91.1 81.4 0.823 66.8 0.702 79.6 0.816 82.2 0.882 97.3 0.964 91.5 0.903 74.3 0.775 66.4 0.76
1:4 89.4 0.889 87.4 92.7 70.3 0.74 57.4 0.582 72.3 0.703 74.5 0.729 89.3 0.875 81.3 0.84 64.1 0.607 55.2 0.523
1:5 82.4 0.854 81.3 82.9 38.7 0.332 43.1 0.464 58.5 0.573 51.2 0.548 70.1 0.74 63.9 0.62 40.6 0.399 42.4 0.458

Note: Bold value represents the highest AUC or accuracy achieved for a particular attribute using a different data ratio (DR).

3.3.2. Malignancy prediction

In these experiments, the performance of the malignancy prediction was assessed (with the attribute prediction branch removed) to serve as a baseline for the proposed system. The obtained results with each of the input streams (one slice, eight slices, and 16 slices) are listed in the top three rows of Table 5. It can be observed that the malignancy prediction accuracy and AUC are better with a eight-slice sub-volume input since it offers more contextual and background information. The performance degrades for a larger sub-volume (16 slices). This appears to be due to the large, fuzzy boundaries of small nodules (particularly benign ones where the nodule is restricted to the middle 8 to 10 slices).

Table 5.

Lung nodule classification (malignancy and attributes) results for the different variants of the proposed network.

Task Malignancy Spi Sub Mar Tex Int Cal Lob Sph
Approach Acc AUC Sens Spec Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC
Single slice 73.6 0.782 Not reported
Eight Slices 81.3 0.822
Sixteen Slices 79.8 0.76
Variant-1 78.9 0.803 77.2 79.6 74.3 0.721 62.3 0.645 70.3 0.734 68.2 0.76 95.2 0.937 85.6 0.824 65.4 0.689 67 0.72
Variant-8 84.5 0.878 83 86.3 79.2 0.812 61.5 0.632 72.9 0.79 74.6 0.812 94.6 0.95 86.9 0.853 68.7 0.738 63.2 0.65
Variant-16 81.2 0.835 80.4 82.8 78.6 0.808 65.5 0.66 72.2 0.782 74 0.80 95.8 0.941 84.3 0.868 69.9 0.724 65.7 0.715
Proposed 89.1 0.91 88.5 91.1 81.4 0.823 66.8 0.702 79.6 0.816 82.2 0.882 97.3 0.964 91.5 0.903 74.3 0.775 66.4 0.76
Shen et al.13 83.4 0.848 66.8 88.9   71.9 0.803 72.5 0.776 83.4 0.850   90.8 0.93   55.2 0.568
Liu et al.12 93.5 0.979 93.0 89.4 Not reported

Note: Bold value represents the highest AUC or accuracy achieved for a particular attribute using different variants of the proposed network.

3.3.3. Malignancy and attribute classification

In this experiment the aim was to assess the performance of the proposed system for different types of inputs. The system’s output was a label for the nodule (benign/malignant) and class labels for its attributes (Table 3). Three variants of the proposed system in Fig. 1 were constructed by choosing the input to be one of the three streams. The obtained results are listed in Table 5. The variants are denoted as variant-n with n denoting the number of slices in the input; n=1, 8, and 16. The accuracy and AUC for the malignancy and attribute prediction are listed in the columns. The performance of the proposed system along with the results of some state of art solutions (Liu et al.12 and Shen et al.13) are also listed in the last three rows of this table. The ROC-AUC plot for the malignancy classification task performed using the proposed system is also shown in Fig. 2.

Fig. 2.

Fig. 2

ROC-AUC Curve for malignancy classification.

An improvement can be seen in results of the 3 variants over the corresponding baseline results (first three rows). Further, the proposed system’s performance is also seen to be better than all of the variants with a performance gain that is quite significant. An explainable lung cancer diagnosis solution was recently proposed by Shen et al.13 wherein the malignancy classification and prediction of five semantic attribute scores was done with full supervision. Barring the prediction of subtlety and texture attributes, our system outperforms Shen et al.13 despite learning three additional attributes under weak supervision. In another recent work, Liu et al.12 posed malignancy and attribute prediction as a multi task problem and proposed a Siamese network-based model that learned to perform the tasks under full supervision. The results reported by Liu et al.12 are in the last row. However, the report was for a pruned set of nodules, obtained after excluding nodules having an average malignancy score by the four experts of exactly three. For a fair comparison, our system was also tested on such a pruned set and was found to achieve an accuracy of 91.7 and AUC of 0.94. The fact that this is achievable after training with just 25% of images having local annotations is noteworthy. Liu et al.12 have not reported attribute classification results but did report results of attribute score regression, which is studied in the next experiment.

3.3.4. Semantic attribute score regression

A regression was done to derive numerical scores (rather than an attribute label). Sample nodule images are shown in Fig. 3. The predicted malignancy labels for these sample nodules are given in Table 6 along with the attribute scores. The convention x/y in the table results denotes the computed/ground truth values. Clinical studies by Truong et al.18 and Sanchez et al.19 suggest that benign nodules have smooth and regular margins whereas malignant ones have ill defined margin and are irregular and spiculated. The scores in Table 6 are consistent with this clinical observation: high scores for spiculation and low scores for margin in the case of malignant nodules and the opposite trend for benign nodules. Further, Ref. 18 states that the common patterns of benign nodule are characterized by calcification that are diffuse, central, laminated, or popcorn. The low calcification score values for benign nodules (first two nodules in Fig. 3) support this statement, whereas the high calcification scores for malignant nodules (third and fourth nodules in Fig. 3) signal the absence of calcification and support the model’s malignancy (correct) prediction.

Fig. 3.

Fig. 3

Six sample nodule images and their predicted labels. Predictions for images from left to right: images 1 and 2 are benign (true); images 3 and 4 are malignant (true); image 5 is malignant (false) and image 6 is benign (false).

Table 6.

Predicted attribute values for the images shown in Fig. 3. listed are the predicted/ground truth value, which is the normalised average of the ratings from experts.

Task Mal Spi Sub Mar Tex Int Cal Lob Sph
Image # from left to right Predictions
1 0/0 0.32/0.26 0.87/0.93 0.96/0.93 0.99/1.0 0.01/0.0 0.55/0.5 0.52/0.66 0.65/0.66
2 0/0 0.19/0.25 0.92/1.0 0.97/1.0 0.99/1.0 0.0/0.0 0.49/0.5 0.33/0.25 0.98/0.95
3 1/1 0.87/0.93 0.86/1.0 0.47/0.53 0.71/0.6 0.0/0.0 0.94/1.0 0.60/0.53 0.52/0.53
4 1/1 0.94/0.9 0.89/1.0 0.73/0.75 0.79/0.75 0.02/0.0 0.99/1.0 0.64/0.65 0.75/0.7
5 1/0 0.24/0.0 0.63/0.75 0.94/1.0 0.41/0.25 0.01/0.0 0.39/0.50 0.22/0.25 0.92/0.9
6 0/1 0.45/0.7 0.92/0.95 0.71/0.8 0.67/0.8 0.01/0.0 0.84/1.0 0.43/0.5 0.79/0.7

The misclassified examples (last two nodules in Fig. 3) do not have scores consistent with the expected trend. For instance, the fifth nodule (from left) has a low spiculation score and high margin score, which is not consistent with the model’s prediction of the nodule being malignant. Providing these attribute scores as an explanation can help a clinician resolve this inconsistency and help to build confidence in the model’s performance.

A quantitative assessment based on the mean error (absolute difference) between the ground truth and the predicted values of the attributes was done. Results are reported in Table 7 on the original un-normalised data, as is the practice in recent literature. The last two rows show results of methods that perform both malignancy and attribute score prediction while the other methods perform only attribute score prediction. While the Siamese network-based fully supervised system of Liu et al.12 has the best performance, our model based on a lighter network gives comparable performance despite learning the score prediction under weak supervision.

Table 7.

Semantic attribute score regression performance—mean absolute difference between unnormalized ground truth and predicted values are reported.

Attribute Mar Spi Sph Sub Int Cal Lob Tex
ENet8 0.98 0.86 1.09 1.20 0.14 1.44 0.96 1.24
MTR7 0.86 0.80 0.81 0.75 0.04 0.48 0.87 0.58
Chen et al.10 0.90 0.75 0.86 0.89 0.10 0.51 0.85 0.73
Liu et al.12 0.54 0.49 0.59 0.54 0.03 0.56 0.54 0.44
Proposed method 0.71 0.61 0.55 0.75 0.02 0.55 0.60 0.42

Note: Bold value represents the minimum mean absolute difference reported for a particular attribute by any method.

3.3.5. Robustness of the system

The proposed system takes a nodule region as input. Our experiments were done using the ground truth locations for nodules in the LIDC-IDRI dataset. To understand how the proposed system will handle random lung patches (false nodules) as input, an experiment was done as follows. A YOLO-based system proposed by Redmon et al.20 detects nodules from CT volumes. The output has true and false detections of the nodules. Hence, the output of the YOLO-based system was used as input to our system. A total of 100 CT scans of the LIDC-IDRI dataset was passed through the above nodule detector with default parameters. The detector output had a total of 1126 lung nodule locations, of which 259 were true positives and 867 were false positives. Theses nodules were extracted and given as input to the proposed system. The system correctly classified 238 out of 259 true positives (both benign and malignant); the primary reason for misclassification of some of the true positives was observed to be a shift of the nodules from the center of the patch (predicted nodule location was off-centered), leading to an incomplete capture of the nodule in some patches. This indicates the need for better post processing after YOLO based detection. Of the 867 false positives, only 49 (5%) were classified as malignant while others were classified as benign, which is desirable. These preliminary results are encouraging and establish the potential for our system to be part of a complete nodule detection and classification pipeline.

4. Conclusions

Lack of explainability of decisions by existing CAD systems is one of the reasons that deters adoption of solutions in clinical scenarios. In this paper, an explainable lung nodule classification system was presented. The system is based on a novel, neural network architecture that is trained with a novel training regime to achieve the classification (benign/malignant) with full supervision and generate explanation (attribute scores) with weak supervision. The system leverages both morphological and contextual information using a combination of 2D and 3D representation for the lung nodules. Our experimental results indicate that good malignancy and attribute score prediction can be obtained with the proposed system after training with N=1662 images in which just a quarter of the training set had full (both malignancy label and attribute scores) annotations while three quarters had only partial (malignancy label) annotation. The performance on balance is better than or on par with the leading fully supervised methods even when the model learns these tasks to provide an explanation for the main decision in a weakly supervised manner. Overall, we believe that the proposed system has exhibited good potential to produce explainable decisions about nodule malignancy on its own or as part of a pipeline that takes a lung CT as input, outputs malignant nodules, and explain the reason for the classification decision semantically. The proposed approach can be extended for the development of other explainable CAD systems such as for breast cancer.

Acknowledgments

All of the model computation and training was done on the GPU clusters (ADA system) maintained for the students studying in International Institute of Information Technology, Hyderabad, India. We thank the college authorities and system admins for the allocation of these computational resources.

Biographies

Aniket Joshi is an undergraduate researcher at the Centre for Visual Information Technology Lab in the International Institute of Information Technology, Hyderabad, India. He is doing his MS in computer science under Professor Jayanthi Sivaswamy. His research interests include medical image analysis, self explainable AI particularly in the medical domain, and computer vision.

Jayanthi Sivaswamy received her PhD in electrical engineering from Syracuse University, Syracuse, New York. Since 2001, she has been with the International Institute of Information Technology, Hyderabad, India. Prior to that, she was with the University of Auckland, Auckland, New Zealand. Her research interests focus on medical image analysis, CAD algorithm development, in particular.

Gopal Joshi received his PhD in computer science and engineering from the International Institute of Information Technology, Hyderabad, India. Currently, he is vice president of data science, with Noodle Analytics Private Limited. His research interests include image analysis, computer vision, and deep learning for enterprize AI products.

Disclosures

The authors have no relevant conflicts of interest (financial or otherwise) to disclose.

Contributor Information

Aniket Joshi, Email: aniket.joshi@research.iiit.ac.in.

Jayanthi Sivaswamy, Email: jsivaswamy@iiit.ac.in.

Gopal Datt Joshi, Email: gopaljoshi4u@gmail.com.

Code, Data, and Materials Availability

The public dataset LIDC-IDRI14 is used in the paper. All code used for the experiments in the paper will be made publicly available shortly.

References

  • 1.Xie Y., et al. , “Knowledge-based collaborative deep learning for benign-malignant lung nodule classification on chest CT,” IEEE Trans. Med. Imaging 38(4), 991–1004 (2019). 10.1109/TMI.2018.2876510 [DOI] [PubMed] [Google Scholar]
  • 2.Zhao X., et al. , “Agile convolutional neural network for pulmonary nodule classification using CT images,” Int. J. Comput. Assisted Radiol. Surg. 13, 585–595 (2018). 10.1007/s11548-017-1696-0 [DOI] [PubMed] [Google Scholar]
  • 3.Causey J., et al. , “Highly accurate model for prediction of lung nodule malignancy with CT scans,” Sci. Rep. 8, 9286 (2018). 10.1038/s41598-018-27569-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Shen W., et al. , “Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification,” Pattern Recognit. 61, 663–673 (2017). 10.1016/j.patcog.2016.05.029 [DOI] [Google Scholar]
  • 5.Li W., et al. , “Pulmonary nodule classification with deep convolutional neural networks on computed tomography images,” Comput. Math. Methods Med. 2016, 6215085 (2016). 10.1155/2016/6215085 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Snoeckx A., et al. , “Evaluation of the solitary pulmonary nodule: size matters, but do not ignore the power of morphology,” Insights Imaging 9, 73–86 (2018). 10.1007/s13244-017-0581-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Chen S., et al. , “Bridging computational features toward multiple semantic features with multi-task regression: a study of ct pulmonary nodules,” Lect. Notes Comput. Sci. 9901, 53–60 (2016). 10.1007/978-3-319-46723-8_7 [DOI] [Google Scholar]
  • 8.Zou H., et al. , “Regularization and variable selection via the elastic net,” J. R. Stat. Soc. Ser. B 67, 301–320 (2005). 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]
  • 9.Hussein S., et al. , “Risk stratification of lung nodules using 3D CNN-based multi-task learning,” Lect. Notes Comput. Sci. 9901, 249–260 (2017). 10.1007/978-3-319-46723-8_7 [DOI] [Google Scholar]
  • 10.Chen S., et al. , “Automatic scoring of multiple semantic attributes with multi-task feature leverage: a study on pulmonary nodules in CT images,” IEEE Trans. Med. Imaging 36(3), 802–814 (2017). 10.1109/TMI.2016.2629462 [DOI] [PubMed] [Google Scholar]
  • 11.Botong W., et al. , “Joint learning for pulmonary nodule segmentation, attributes and malignancy prediction,” in IEEE 15th Int. Symp. Biomed. Imaging, pp. 1109–1113 (2018). 10.1109/ISBI.2018.8363765 [DOI] [Google Scholar]
  • 12.Liu L., et al. , “Multi-task deep model with margin ranking loss for lung nodule analysis,” IEEE Trans. Med. Imaging 39(3), 718–728 (2020). 10.1109/TMI.2019.2934577 [DOI] [PubMed] [Google Scholar]
  • 13.Shen S., et al. , “Explainable hierarchical semantic convolutional neural network for lung cancer diagnosis,” in IEEE Conf. Comput. Vision and Pattern Recognit. Workshops (2019). [Google Scholar]
  • 14.Armato S., III, et al. , “The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans,” Med Phys 38, 915–931 (2011). 10.1118/1.3528204 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Yan X., et al. , “Classification of lung nodule malignancy risk on computed tomography images using convolutional neural network: a comparison between 2D and 3D strategies,” Lect. Notes Comput. Sci. 10118, 91–101 (2017). 10.1007/978-3-319-54526-4_7 [DOI] [Google Scholar]
  • 16.Jung H., et al. , “Classification of lung nodules in CT scans using three-dimensional deep convolutional neural networks with a checkpoint ensemble method,” BMC Med. Imaging 18, 48 (2018). 10.1186/s12880-018-0286-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Barshan E., Fieguth P., “Stage-wise training: an improved feature learning strategy for deep models,” in Proc. 1st Int. Workshop Feature Extraction: Modern Questions and Challenges at NIPS (2015). [Google Scholar]
  • 18.Truong M. T., et al. , “Update in the evaluation of the solitary pulmonary nodule,” Radiographics 34(6), 1658–1679 (2014). 10.1148/rg.346130092 [DOI] [PubMed] [Google Scholar]
  • 19.Sánchez M., et al. , “Management of incidental lung nodules <8 mm in diameter,” J. Thorac. Dis. 10(22), S2611–S2627 (2018). 10.21037/jtd.2018.05.86 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Redmon J., et al. , “You only look once: unified, real-time object detection,” in IEEE Conf. Comput. Vision and Pattern Recognit., pp. 779–788 (2016). 10.1109/CVPR.2016.91 [DOI] [Google Scholar]

Articles from Journal of Medical Imaging are provided here courtesy of Society of Photo-Optical Instrumentation Engineers

RESOURCES