Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Dec 7.
Published in final edited form as: Phys Med Biol. 2021 Dec 7;66(24):10.1088/1361-6560/ac3d16. doi: 10.1088/1361-6560/ac3d16

Improving robustness of a deep learning-based lung-nodule classification model of CT images with respect to image noise

Yin Gao 1,, Jennifer Xiong 1, Chenyang Shen 1, Xun Jia 1
PMCID: PMC8694870  NIHMSID: NIHMS1762595  PMID: 34818638

Abstract

Objective:

Robustness is an important aspect to consider, when developing methods for medical image analysis. This study investigated robustness properties of deep neural networks (DNNs) for a lung nodule classification problem based on CT images and proposed a solution to improve robustness.

Approach:

We firstly constructed a class of four DNNs with different widths, each predicting an output label (benign or malignant) for an input CT image cube containing a lung nodule. These networks were trained to achieve Area Under the Curve of 0.891-0.914 on a testing dataset. We then added to the input CT image cubes noise signals generated randomly using a realistic CT image noise model based on a noise power spectrum at 100 mAs, and monitored the DNN’s output change. We defined SAR5(%) to quantify the robustness of the trained DNN model, indicating that for 5% of CT image cubes, the noise can change the prediction results with a chance of at least SAR5(%). To understand robustness, we viewed the information processing pipeline by the DNN as a two-step process, with the first step using all but the last layers to extract representations of the input CT image cubes in a latent space, and the second step employing the last fully-connected layer as a linear classifier to determine the position of the sample representations relative to a decision plane. To improve robustness, we proposed to retrain the last layer of the DNN with a Supporting Vector Machine (SVM) hinge loss function to enforce the desired position of the decision plane.

Main results:

SAR5 ranged in 47.0 ~ 62.0% in different DNNs. The unrobustness behavior may be ascribed to the unfavorable placement of the decision plane in the latent representation space, which made some samples be perturbed to across the decision plane and hence susceptible to noise. The DNN-SVM model improved robustness over the DNN model and reduced SAR5 by 8.8 ~ 21.0%.

Significance:

This study provided insights about the potential reason for the unrobustness behavior of DNNs and the proposed DNN-SVM model improved model robustness.

1. Introduction

Deep learning (DL) has recently demonstrated its great power in solving a number of machine learning problems in various domains including medicine (LeCun et al., 2015; Shen et al., 2020b; Cui et al., 2020; Sahiner et al., 2019). The mainstream approach of DL is to employ a deep neural network (DNN) with a large-scale hierarchical multi-layer structure to approximately represent a distribution by learning from data. With the flexibility and capacity of DNNs, it is possible to build models to capture the underlying structure hidden in the data and establish a direct end-to-end mapping from the input domain to the output domain. Within such a data-driven scheme, numerous studies have been performed to construct DL models for various purposes, such as for medical image processing (Jiang et al., 2019; Han, 2017), disease diagnosis (Esteva et al., 2017; Lakhani and Sundaram, 2017), outcome prediction of therapeutic approaches (Ewbank et al., 2020; Men et al., 2019), radiotherapy treatment planning (Nguyen et al., 2019; Kontaxis et al., 2020; Shen et al., 2020a, 2019) and quality assurance (Nyflot et al., 2019; Tomori et al., 2018). These models have achieved superior performance compared to traditional machine learning models, and sometimes it was reported that the performances were comparable or even better than experienced clinicians.

Despite these successes, the large number of linear/non-linear numerical operations organized in a hierarchical structure of the DNNs make it extremely challenging to theoretically study mathematical properties of the DNN models. Studying these properties not only helps us to interpret the results, more importantly, it also enables us to understand the strengths and limitations of the models, which is critical for us to confidently apply the models in real clinical contexts.

Among many properties of DNN models, robustness is particularly important for medical applications. Generally speaking, robustness of a model means its ability to have its output resilient to small perturbations to the model’s input. As small perturbations are often inevitable in any data, e.g. in the form of noise signals due to data acquisition uncertainty, having a robust model is extremely critical for it to generate reliable outputs. Otherwise, the output would be unstable, which may mislead clinical decision making, or even cause severe healthcare consequences to patients.

Unfortunately, several recent studies have revealed that robustness of DL models may be a concern (Eykholt et al., 2018; Madry et al., 2017; Akhtar and Mian, 2018; Yuan et al., 2019): for a well trained model, there often exist small perturbations to the input data that may fool the model and make it confidently output incorrect results. The behavior has been observed in DL models for medical applications. In (Finlayson et al., 2018), the authors demonstrated the existence of perturbed samples that are capable of manipulating three representative medical DL classifiers: those for the classification of diabetic retinopathy from retinal fundoscopy, pneumothorax from chest-xray, and melanoma from dermoscopic photographs. Similar behavior was also observed in another study using a DL model for lung-nodule classification from CT images (Shen et al., 2020c). The broad existence of this robustness issue raised concerns about the practical value of the models. Although the chance of having perturbations that can successfully fool the DL models may be relatively low and sometimes it requires carefully designed techniques to generate these perturbations (often called attacks), the existence of these perturbations is indeed a problem in health care, as severe consequences of incorrect decisions may occur. In a recent paper published on Science, the authors discussed the risk of this vulnerability of machine learning models in health care systems, urged practitioners to be aware of this issue when deploying DL systems in clinical settings, and called for active efforts to address this problem (Finlayson et al., 2019).

Besides classification problems, robustness has also been investigated for image processing problems and extensive studies have been performed to improve model robustness. Antun et al. (2020) discussed the possible causes of network instabilities. Wu et al. (2020) and Wu et al. (2021) proposed solutions to eliminate the instabilities using Analytic Compressed Iterative Deep based reconstruction method and Dual-domain Residual-based Optimization Network. Recent works also found that enforcing sparsity in network parameters may improve robustness (Hoefler et al., 2021), and hence sparsity was incorporated as one of the objectives in network training (Öztürk, 2021).

In a previous study, we have investigated robustness issue of a DL-based lung nodule classification model for CT images with respect to noise perturbations to the input images (Shen et al., 2020c). As noise signal in CT images is inevitable and sometimes its amplitude could be relatively large due to the use of low milli-Ampere-second (mAs) levels in CT scans to reduce x-ray exposures to patients, understanding robustness of the model with respect to noise is important for an objective evaluation of the model’s practical value. The study found that, although the DNN was trained to classify 3D CT image cubes of lung nodules into malignant or benign groups with a satisfactory level of performance (Area Under the Curve of 0.91 for the testing dataset in a 10-fold cross validation as compared to radiologists’ predictions), the robustness of the classification was a concern. With 100 random noise perturbations at 100 mAs, DNN’s predictions for 11.2% of training data and 17.4% of testing data were successfully changed by at least once. The percentages were even higher for purposely designed noise perturbations.

In this paper, we continued the study along this direction with the goal of understanding this behavior and developing approaches to mitigate this problem. The specific contributions are twofold. First, we analyzed the data processing pipeline in the DNN model by viewing it as a two-step process. The unrobustness issue can be partially ascribed to the second step that uses a linear classifier but with an unfavorable decision boundary to classify data representation in a latent space extracted by the first step. Second, we proposed a modified model joining DNN and Supporting Vector Machine (SVM) to mitigate the robustness issue. We hope our study will shed some lights on this problem and gain insights for future development.

2. Methods

2.1. DNN-based lung-nodule classification models

2.1.1. Network structure

We focused on the problem of lung nodule classification based on 3D CT images by studying a class of DNNs with four scenarios of different network sizes. Each of these DNNs was constructed to output a binary label for an input 3D CT image cube containing a lung nodule, indicating that it is benign or malignant. All the DNN models were constructed using similar architectures (see Fig. 1) but with different widths. In all networks, the DNN consisted of N = 4 convolutional blocks. The i-th convolutional block (i = 0,…, 3) was constructed by a 3-D convolutional layer with 8k × 2i 3-D convolution kernels, each with a size 3 × 3 × 3, where k = 1,…, 4 is a parameter controlling the number of kernels. Each of the convolutional layers was followed by a rectified linear unit (ReLU), a batch normalization layer, and a 3-D pooling layer. The second half of the DNN contained M = 4 fully-connected blocks. The j-th block (j = 0,…, 3) contained a fully-connected layer with an output size of 8k × 23−j. The fully-connected layer was followed by a ReLU layer, except in the last block, where a sigmoid layer was used after the fully-connected layer to ensure a binary output at the end. The sigmoid layer was realized using the function T(t) = [exp(t) − 1]/[exp(t) + 1], where t is the input variable to this layer.

Figure 1.

Figure 1.

The detailed network architecture of DNN models used in this study.

2.1.2. Data

Training and testing of all the DNN models were performed using data provided by the Lung Image Database Consortium (LIDC) (Armato III et al., 2011) of Imaging Database Resource Initiative (IDRI) from The Cancer Imaging Archive (TCIA) (Clark et al., 2013). This LIDC dataset consists of lung CT images collected from eight different institutions. Each lung nodule in the CT images has been investigated and rated on a scale from 1 to 5 by multiple radiologists to characterize the malignancy suspiciousness of the nodule. A rating of 1 indicates the least suspicious and 5 indicates the most. Among all the nodules in CT images, we considered only those with a size of at least 3 mm. For each nodule, we averaged the ratings given by all the radiologists and removed nodules with an ambiguous average rating, i.e. rating = 3. This yielded a set of 1,226 nodules from the LIDC dataset, of which 431 were malignant (rating > 3) and 795 were benign (rating < 3). Prior to the model training, all the CT images were interpolated to the same resolution of 1 mm3. After that, for each nodule, a 3D image cube of a size 32×32×32 voxels was extracted to be used as the input to the DNN models. 80% (981 samples) of nodules were randomly selected among all the samples for training purpose. These data were evenly split into five groups. In the training process for each of the four scenarios, a five-fold cross validation (CV) study was performed. In each fold, four groups of data were picked as training data (Ntrain = 884), and one group as validation data. Testing was performed using the remaining 20% (245 samples) that were not seen by the training process.

2.1.3. Network training

Training DNNs was achieved in the supervised learning fashion by solving the following optimization problem:

θ=argminθiTD(xiθ)yi22, (1)

where xiR32×32×32 represents the i-th CT image cube in the training dataset T. yi is the sample’s corresponding reference label, with yi=1, if xi is malignant, and yi=1 otherwise. D(xiθ) represents the DNN function, and θ denotes the network parameters to be determined by the training process.

Computations in this study were performed using Python with TensorFlow. The computational hardware platform included two Nvidia Quadro RTX 5000 GPU, CPU of 26 cores, and 64 GB of host memory. The optimization problem was solved using the adaptive moment estimation (ADAM) algorithm (Kingma and Ba, 2014). We trained each network with 500 epochs using a batch size of 16 and an initial learning rate of 1 × 10−5. The learning rate was adaptively adjusted using the ReduceLROnPlateau function in TensorFlow. These hyper-parameters were manually selected for the best performance.

2.1.4. Evaluation

Upon completion of the training process, we first evaluated classification performance before moving on to robustness evaluation, to ensure that the robustness study was conducted for DNN models with reasonable performances. As such, we used common classification evaluation metrics including the area under the curve (AUC), accuracy, sensitivity, and specificity. AUC was obtained by computing the area under the receiver operative characteristic (ROC) curves. To benchmark the effectiveness of our DNN models and the training process, we compared the performance of the trained models with Multi-Crop Net Shen et al. (2017), a recently developed DL-based model for lung nodule classification achieving the state-of-the-art performance. Note that such a comparison in classification performance was not the main focus of this study. We reported this comparison for the purpose of illustrating that our models achieved good classification performance to lay the ground for the subsequent robustness study.

Once the models’ performance was confirmed, we evaluated their robustness with respect to randomly sampled perturbations added to the input CT image cubes in the training dataset. For each model, let yi denote the model’s predicted label for an unperturbed input CT image cube xi. For each xi, we repeatedly generated 100 noise signal realizations at 100 mAs, a commonly used mAs level for CT scans in real clinical settings. The noise signals were sampled using a previously developed method (Shen et al., 2020c) based on noise amplitude and a realistic noise power spectrum. For each noise signal, we added it to the CT image cube xi and fed the perturbed image into the DNN to compute the new output label, denoted as yi. Such a study is usually called attacking the DNN using the generated noise signal. A reliable classification model is expected to be robust against the attacks, as noise in clinical CT scans is unavoidable and the model should be resilient against it. An attack is called successful, if yiyi.

To quantitatively measure robustness, we defined a metric Successful Attack Rate (SAR). It was defined to measure the vulnerability of a CT image cube for the trained DNN. For xi, SAR was computed as:

SAR=nsucciNattack(%), (2)

where nsucci represents the number of successful attacks for the CT image cube xi and Nattack = 100 is the total number of attacks. A higher SAR value indicates a higher vulnerability of the sample. With SARs computed for all samples, we first plot SAR-Sample Number Histogram (SSNH). SSNH(x) is the percentage of samples that have SAR higher than x. With this definition, SSNH(x) is a monotonically decaying function of x, and the lower the curve is, the more robust the DNN is. To use a single number to represent the robustness of the DNN, we reported SAR5 defined as SSNH(SAR5) = 5%. More specifically, 5% of samples have their SAR values larger than SAR5.

2.2. Robustness viewed from the angle of two-step data processing

To obtain insights on why the models are vulnerable to noise perturbations, we selected a representative model established in the fifth fold of the scenario with k = 1. Our idea is illustrated in Fig. 3.

Figure 3.

Figure 3.

Illustration of workflow to analyze the robustness issue. In the middle figure, a 3-dimensional view is drawn for illustration purpose, while the actual dimension is 32.

The DNN processes information from the input CT image cube x to a decision label y. We viewed this data processing as a two-step process. The first step used all the convolutional blocks and all but the last fully connected blocks. It extracted a low-dimensional representation of the input CT image cube xR32×32×32 in this latent representation space, which contained key discriminating information for subsequent label determination. In the network we analyzed, the dimension of this representation space was 32 and we denote the representation of the input CT image as a vector sR32. In the second step, the operation of the last fully-connected layer was y = T(WT s + b) ∈ {0,1}, where W is a vector formed by the network weights in the last layer, b is the bias of this layer, and T(.) is the sigmoid function. The weights W and bias b defined a linear decision boundary P : WT s + b = 0 in the latent space, which explicitly described the way of separating the representations of input CT image cubes of benign and malignant nodules. For the representation s of an input CT image cube x, the value WT s + b > 0 or < 0 determines the output label y = 1 or −1, respectively. Hence, studying the representation vectors of CT image cubes in this latent representation space relative to the decision plane P with and without adding noise perturbations could provide us insights about how the DNN’s decision was affected by the noise perturbations.

As such, we first computed the Euclidean distance of the representation s of a CT image cube to the decision boundary P. We expect that the CT image cubes with representations close to the boundary may be easily perturbed by noise, moving them to cross the boundary P, hence changing the output labels and yielding the observed unrobustness behavior.

We further visualized the geometric distributions of the representations of CT image cubes with and without adding noise perturbations relative to the decision boundary. As visualizing the 32-dimension representation space is not possible, we conducted a dimension-reduction operation to reduce the number of dimensions to two. For this purpose, two vectors in the R32 space should be selected. The representations of CT image cubes, as well as the decision boundary P, are projected to the linear space spanned by these two basis vectors to generate visualizations in this 2-dimensional space. In this study, as the focus was to understand how perturbations affect decision-making of a DNN model, we selected two basis vectors in the R32 representation space that were generally along the directions perturbing the two classes of samples. Specifically, for each sample in the malignant class, we first performed principal component analysis (PCA) of all the Nattack = 100 noise perturbations and found out the first principal component representing the major direction of perturbations to this sample. We then performed PCA one more time over the set of the first principle component of each malignant sample, and selected the resulting first principle component as the major direction perturbing samples in this class. The result is denoted as ê1. Similarly, we obtained ê2 for the benign class. With these two basis vectors selected, each sample in the representation space with a coordinate s was projected to the 2-dimensional visualization space with a coordinate z = [ê1, ê2]T s. We also computed the projection of the decision boundary P to this 2-dimensional space to visualize the positions of samples relative to the decision boundary.

2.3. DNN-SVM model to improve robustness

Within the picture of this two-step data processing, in Section 3.2, we will show that one reason for the model being not robust is that the decision boundary plane P determined by the model training process was placed close to one of the two classes. Hence small perturbations to the samples may move the sample representation crossing the boundary to the incorrect side. This occurred, because the DNN model training objective specified in the minimization problem of Eq. (1) only tried to enforce the agreement between the predicted labels and the ground truth labels, but did not specify any preference about the position of the decision boundary.

Hence, one approach that may contribute to improving robustness of the DNN models is to place the decision boundary relatively far away from both classes, while maintaining the classification performance. In fact, in machine learning, it is well known that a decision plane away from both classes, while still being effective to separate them, is desired (Duda et al., 2006). This may be achieved by many classification approaches, such as Supporting Vector Machine (SVM), an algorithm that maximizes the margin between the decision hyper-plane and the samples (Boser et al., 1992; Cortes and Vapnik, 1995). Hence, in this study, we proposed a DNN-SVM model. Specifically, after the DNN training step presented in Section 2.1, we frozen the resulting network in all but the last fully connected layer, and retrained the last layer using a hinge loss. Specifically, the optimization problem for the retraining of the DNN-SVM models is:

{W,b}=minW,b12WTW+Ci=1max(0,1yi(WTsi+b)) (3)

where siR32 is the representation of the i-th input CT image cube. Since this retraining process only attempted to update the last layer of the DNN, si was known, which was the output by the second last layer of the trained DNN. yi is the i-th sample’s corresponding reference label. C is a regularization parameter, which controls the strength of the penalty. The optimal C value was determined in the DNN-SVM training process.

To train the last layer SVM classifier, a machine learning module ‘scikit-learn’ was used in Python 3.7. Its build-in function GridSearchCV was employed to find the best hyperparameters and avoid overfitting in 5-fold CV. After the retraining process was completed, following the same procedure in developing the original DNN model, we first evaluated model classification performance in a 5-fold CV study to confirm that introducing SVM to the model does not impede classification performance, as evaluated using metrics of accuracy, AUC, sensitivity, and specificity. We then evaluated robustness of the DNN-SVM model to demonstrate the improvement in SAR and SAN.

3. Results

3.1. DNN model’s performance

Fig. 4 depicts accuracy, sensitivity, specificity, and AUC of the DNN models established in different scenarios for training, validation, and independent testing. The results are the average performance of the models and uncertainties estimated over the 5-fold CV study. As we can observe from this figure, the networks’ classification performance was comparable to that of the Multi-crop Net in most cases.

Figure 4.

Figure 4.

Training, validation, and testing classification performance in 5-fold CV for different scenarios. Error bar indicates uncertainty estimated via the CV study. The dashed lines indicate the testing performance of the baseline algorithm Multi-Crop Net(Shen et al., 2017).

After confirming the classification performance, we proceeded to investigating their robustness properties against randomly sampled noise perturbations at 100 mAs. In Fig. 5, we plot the SSNH for different scenarios. There was no obvious dependence of SSNH on the network widths. The SSNH curves initially decayed quickly, as SAR increased. After SAR ~ 10%, the decay became much slower. This trend indicated that a small fraction of samples had relatively large SAR values. Quantitatively, SAR5 ranged in 47.0 ~ 62.0% in different networks, indicating that for 5% of samples, noise attacks can change DNN prediction results with 47.0 ~ 62.0% chances or higher. The results suggested that the issue of robustness against noise perturbations at 100 mAs may be prevalent across all the networks studied here.

Figure 5.

Figure 5.

SSNH plot of the four trained networks. Horizontal dashed line indicate SAR5.

3.2. Robustness viewed from the angle of two-step data processing

Using the model trained in the fifth fold of the k = 1 scenario as a representative, we investigated the information processing pipeline by the DNN. In Fig. 6(a), we first plot the histogram as a function of SAR and distance to the decision boundary P. Majority of samples was found at SAR close to zero, indicating that these samples were relatively robust. However, there existed samples at locations with high SAR values and low distances to the decision boundary. This behavior is understandable. For samples far away from the decision boundary, the large distances to the boundary provided rooms to accommodate perturbations to these samples induced by the noise to the input CT image cubes. With noise added, the perturbed samples still resided in the same side of the decision boundary as the unperturbed sample, keeping the model output unchanged. These samples were hence robust with respect to the noise. In contrast, it may be relatively easy for those samples close to the decision boundary to be moved across the boundary under the noise perturbation, causing the issue of not being robust.

Figure 6.

Figure 6.

(a) Histogram of samples as a function of SAR and the distance to DNN decision boundary. Color bar represents the counts of samples in each bin. Arrow 1 points at robust samples, and arrow 2 points at some samples not robust and close to the decision boundary. (b) Visualization of sample distribution and the decision boundary generated by the DNN model. Red dots: predicted malignant samples by the DNN model. Blue dots: predicted benign samples. Darkness of the color of each sample indicates its SAR. (c) Zoomed-in plot for the region highlighted by the rectangle in (b).

Fig. 6(b-c) allows us to visualize the distribution of samples, as well as their locations relative to the decision boundary in the 2-dimensional visualization space. No that in this figure, the colors of samples correspond to labels predicted by the DNN, not the ground truth labels. Hence, samples with different colors appeared at the opposite sides of the decision plane P. We present labels predicted by the DNN, because when studying the robustness behavior, we focus on the changes of predicted labels caused by by noise perturbations. In Fig. 6(b-c), we also used darkness of the color to indicate SAR of each sample. Similar to the trend observed in Fig. 6(a), samples close to the decision boundary were relatively less robust, majority of which were from the malignant class, because the decision boundary was positioned close to this class of points.

3.3. DNN-SVM model to improve robustness

Based on the observation, it is expected that having the decision boundary away from samples could give room to accommodate noise perturbations and hence improve model robustness. However, when training the DNN, such a desired property was not explicitly required in the optimization model of Eq. (1). Since there was a large gap between the two classes of samples in the representation space, the training algorithm had a lot of freedom to place the decision boundary without sacrificing the training objective in Eq. (1), i.e. classification accuracy. The training process ended up with a particular decision boundary P that was close to the malignant samples, as shown in Fig. 6(b-c). Hence, introducing a specific objective in the training process that attempts to place the decision boundary away from the data samples could improve robustness. This motivated us to develop the DNN-SVM model.

After training the DNN-SVM model in Eq. (3), we first examined the performance of the model and compared with that of the DNN model to ensure that introducing SVM to the model does not sacrifice classification accuracy. The results are summarized in Table 1. The model performances of DNN and DNN-SVM in terms of classification were similar. On average over the four scenarios, the difference in accuracy, AUC, sensitivity, and specificity were 0.007, 0.049, 0.007, 0.011, respectively.

Table 1.

Accuracy, AUC, sensitivity, specificity, and SAR for DNN models and DNN-SVM models in four scenarios. Each number is the averaged result over five-fold CV.

k 1 2 3 4
Accuracy DNN 0.842 0.84 0.847 0.852
DNN-SVM 0.84 0.845 0.838 0.840
AUC DNN 0.898 0.916 0.909 0.891
DNN-SVM 0.845 0.86 0.851 0.862
Sensitivity DNN 0.888 0.896 0.876 0.874
DNN-SVM 0.881 0.886 0.874 0.864
Specificity DNN 0.848 0.842 0.864 0.844
DNN-SVM 0.856 0.841 0.874 0.852
SAR5(%) DNN 47.0 48.0 50.3 62.0
DNN-SVM 39.2 39.3 45.9 49.0

We then investigated each model’s robustness against randomly sampled noise perturbations. SAR5 values are summarized in Table 1. Similar to the DNN models, no obvious dependence of SAR5 on network size was observed for the DNN-SVM architecture. DNN-SVM models outperformed DNN models in robustness in all the four scenarios, as demonstrated by the reduction of SAR5. The relative reductions in SAR5 in these four scenarios were 16.6%,, 18.1%, 8.8%, and 21.0%, respectively.

In Fig. 7(a), we plot the histogram of samples as a function of SAR of the DNN-SVM model and distance to the decision boundary generated by this model. Comparing with Fig. 6(a), the counts in the bins with SAR close to unity, as pointed by the Arrow 2, showed significantly reduction, indicating that the DNN-SVM model improved robustness on those samples close to the decision boundary.

Figure 7.

Figure 7.

(a) Histogram of samples as a function of SAR of the DNN-SVM model and the distance to DNN-SVM decision boundary. Color bar represents the counts of samples in each bin. Arrow 1 points at robust samples, and arrow 2 points at some samples not robust and close to the decision boundary. (b) Visualization of sample distribution and the decision boundary generated by the DNN and DNN-SVM models. Red dots: predicted malignant samples by the DNN-SVM model. Blue dots: predicted benign samples. Darkness of the color of each sample indicates SAR of the sample computed with respect to the DNN-SVM decision boundary. (c) Zoomed-in plot for the region highlighted by the rectangle in (b). Red and blue arrows highlight two areas of SAR changes between DNN and DNN-SVM models.

Fig. 7(b-c) allows us to visualize the distribution of samples as well as the corresponding DNN-SVM decision boundary in the 2-dimensional space. We also plotted the DNN decision boundary for comparison. The SAR value indicated by the color of each sample was computed with respect to the new decision boundary. Comparing Fig. 7(b-c) with Fig. 6(b-c), the DNN-SVM model successfully moved the decision boundary to the middle region between the two classes, as required by the optimization model in Eq. (3). This change did not compromise classification accuracy significantly. However, moving the boundary away from the malignant group improved robustness of these samples, as indicated by the reduction of SAR values for samples pointed by the red arrow in Fig. 7(c). This new decision boundary became closer to the benign group, hence making samples in this group slightly less robust, as indicated in those samples pointed out by the blue arrow. However, overall the model robustness considering both groups was improved.

The improvement in robustness can be further illustrated in Fig. 8, which presents the distribution of original positions of samples, as well as their corresponding perturbed positions. To simplify illustration, only one perturbed position per sample is displayed. Note that perturbations to samples are independent of the decision boundaries. With the DNN decision boundary, the unrobust model performance was clearly ascribed to those red perturbed sample positions that crossed the decision boundary. Improved robustness was achieved by moving the decision boundary to retain a lot of perturbed samples to the same side of the decision boundary as the unperturbed samples.

Figure 8.

Figure 8.

Distribution of samples and their corresponding perturbed samples. The red arrow points to samples that are perturbed by noise by a relatively large distance.

4. Conclusion and Discussions

Robustness is an important aspect when evaluating methods of medical image analysis. In this paper, we conducted studies on a class of four DNNs with different widths for a lung nodule classification problem based on CT images. These networks were trained to achieve AUC of 0.891-0.916 in a 5-fold cross validation study computed on test dataset. We then added to the input CT images noise signals generated randomly using a realistic noise model based on a noise power spectrum at 100 mAs, and monitored the DNN’s output. The results showed that the CT noise was able to affect the prediction results of the established DNN models. SAR5 ranged in 47.0 ~ 62.0% in different scenarios. In an effort to understand the unrobust behavior of DNNs, our analysis viewed the information processing pipeline of the DNN as a two-step process. In the first step, layers proceeding to the last layer were trained to generated a vector representation from the input CT image cube by using convolutional layers followed by a series of fully-connected layers. The end product of these operations was a vector representation sR32 of an input CT image cube x. In the second step, the last layer used a linear operation to determine the position of the sample representation s relative to a planar decision boundary P, outputting the final classification label y for the input image x. Based on this interpretation, the fact that the noise can perturb the representation in the latent space to the incorrect side of the decision boundary is a reason for the DNN being not robust against noise perturbations. This can be ascribed by the DNN training model in Eq. (1) that only enforced classification accuracy, but did not have any preference on the position of the decision boundary. In light of this fact, we proposed a DNN-SVM model to retrain the DNN model by modifying its last layer and using a SVM approach to ensure a favorable position of the decision boundary. This approach enlarged the distance of the decision boundary to samples, making samples more tolerable to perturbations. The DNN-SVM model achieved a similar level of classification performance to that of the DNN model, but substantially improved robustness, as indicated by the averaged 16.1% reduction of SAR5 in the four networks studied.

The DNN-SVM model only partially addressed the unrobustness problem. For this method to work, one assumption was that the noise added to the input CT image cube perturbs the sample representation in the latent space locally. Nonetheless, this assumption may partially hold. Although amplitudes of noise signals added to the original CT images was always small, as controlled by the specific mAs level, the variations of the representation vector may not be necessarily small, due to the highly complex mapping function from the input image to the latent representation. As indicated by the arrow in Fig. 8, there exist samples whose representation vector may be displaced by a large distance by noise perturbations to the input CT image. This phenomenon occurred, because the training optimization problem in Eq. (1) did not require that a small noise perturbation transformed to a small movement of the representation vector. It is expected that adding a regularization term into the objective function to ensure a smooth mapping from the original CT image space to the latent representation space would help ensuring local movements of the representation vectors under noise perturbations, which would be therefore beneficial for model robustness consideration. This approach will be explored in our future work.

The models combining DNN and SVM have been previously studied (Zhang et al., 2015; Gong et al., 2020). Previous studies mainly focused on the evaluations of classification performance. In this study, we proposed the use of DNN-SVM to address the robustness problem of a DNN model. The inclusion of SVM in the model was motivated by the interpretation of the two-step data processing pipeline of the DNN. We employed the SVM approach as a retraining step to update the last layer of the DNN model, whereas the previous layers trained by the DNN were frozen. The choice was made, because it allowed us to isolate the effect of the SVM layer from previous layers and focus on the effect of SVM in terms of improving robustness. Specifically, by retraining the last layer we could keep the feature extraction part (all but the last layer) unchanged, while only the linear classification part was modified in the retraining process. Analyzing the linear classification behavior at the last layer allowed us to understand some of the reasons for the observed unrobust behavior. In contrast, retraining DNN-SVM model in an end-to-end fashion would modify both feature extraction and classification part, making it difficult to interpret the results.

The current study only evaluated robustness of the DNN models on the training dataset, but not on the testing dataset. Hence, generalizability of the observed robustness issue could be a question. Our previous study (Shen et al., 2020c) found that robustness is generalizable from training to testing based on empirical observations. The current study provided some insights about this fact. Based on the discussion above, the robustness issue was partially ascribed to the fact that the DNN training placed a decision boundary close to the samples in the representation space and the noise perturbation may move the samples to the incorrect sides. These facts are independent of training or testing samples. Meanwhile, testing samples are expected to follow a similar distribution as the training samples, as guaranteed by the generalizability of the trained model in terms of classification performance. Hence, even for testing samples not observed by the training process, since they follow the same distribution as the training samples, they would still be affected by the issues causing the robustness problem and benefit from the DNN-SVM model to improve robustness.

We simply used randomly generated noise perturbations in this study to evaluate robustness of the DNNs. In our previous study (Shen et al., 2020c), purposely generated perturbations can be obtained by solving an optimization problem to find those perturbations that can alter the DNN’s output for a given sample. This strategy allows finding perturbations causing the robustness concern more effectively, because random sampling in the high-dimension space of CT image cubes is very inefficient. Hence, the SAR values reported in this study were considered as levels close to their lower ends.

The current study has several limitations. First, it only evaluated robustness of a class of DNNs with a linear structure defined in Fig. 1. Due to theoretical challenges to analyze DNN’s mathematical properties, it is difficult to generalize the observed behaviors to other types. Hence, we stress that the conclusion here is limited to the setups investigated in this study. However, we do want to raise the awareness to the robustness concern and suggest studies developing new DNN models should include robustness as one of the aspects to evaluate. Over the years there are numerous other network structures and training techniques that have been developed for classification problems. For example, Residual Long Short-Term Memory layered CNN has been showed to achieve better classification performance than SVM method (Öztürk and Özkaya, 2021). Additionally, as this study only focused on classification problems, we selected a typical network scheme for classification that employs a few layers of convolution operations for feature extraction, followed by fully connected layers to decide the output. Yet, robustness concern also exists in DL-based non-classification problems, e.g. image processing (Antun et al., 2020). Evaluating robustness of networks for these problems, e.g. 3D U-Net and ResNet will be necessary. It is our on-going study to evaluate robustness of networks with other structures and training techniques.

Second, we only focused on the noise level corresponding to 100 mAs. Our previous work (Shen et al., 2020c) showed that the severeness of the robustness problem monotonically decreases with noise amplitude. As the noise amplitude is one critical factor affecting model robustness and is related to clinical consideration of radiation exposure, it is necessary to study models’ behavior under other mAs levels in future studies. Additionally, due to computational challenge, we simply generated noise signals via a model and added them to the CT image cubes, instead of simulating the noise generation via a realistic image formation process, e.g. simulating the x-ray projection and image reconstruction process. Our noise model was based on a given noise power spectrum, which hence ignored the spatial variance of noise properties in a CT image. Last, the labels of the nodules used in this study were given by radiologists, but not based on pathological assessments. Strictly speaking, the validity of using them as ground truth for the purpose of developing a classification model may be questioned. Nonetheless, as the major goal of this work is to evaluate robustness of trained DNNs, we expect that our study is valid as long as the classification model is developed with a reasonable performance.

Figure 2.

Figure 2.

Illustration of robustness analysis. Realistic noise signals at 100 mAs was added to a CT image cube. Robustness was evaluated by comparing the predicted labels with and without adding noise to the CT image.

Acknowledgement

This work was funded in part by the National Institutes of Health under the grants R37CA214639, R01CA227289, R01CA237269.

References

  1. Akhtar Naveed and Mian Ajmal. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6:14410–14430, 2018. [Google Scholar]
  2. Antun Vegard, Renna Francesco, Poon Clarice, Adcock Ben, and Hansen Anders C. On instabilities of deep learning in image reconstruction and the potential costs of ai. Proceedings of the National Academy of Sciences, 117(48):30088–30095, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Armato Samuel G III, McLennan Geoffrey, Bidaut Luc, McNitt-Gray Michael F, Meyer Charles R, Reeves Anthony P, Zhao Binsheng, Aberle Denise R, Henschke Claudia I, Hoffman Eric A, et al. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics, 38(2):915–931, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Boser Bernhard E, Guyon Isabelle M, and Vapnik Vladimir N. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992. [Google Scholar]
  5. Clark Kenneth, Vendt Bruce, Smith Kirk, Freymann John, Kirby Justin, Koppel Paul, Moore Stephen, Phillips Stanley, Maffitt David, Pringle Michael, et al. The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging, 26(6):1045–1057, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cortes Corinna and Vapnik Vladimir. Support-vector networks. Machine learning, 20(3):273–297, 1995. [Google Scholar]
  7. Cui Sunan, Tseng Huan-Hsin, Pakela Julia, Haken Randall K Ten, and Naqa Issam El. Introduction to machine and deep learning for medical physicists. Medical Physics, 47(5):e127–e147, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Duda Richard O, Hart Peter E, et al. Pattern classification. John Wiley & Sons, 2006. [Google Scholar]
  9. Esteva Andre, Kuprel Brett, Novoa Roberto A, Ko Justin, Swetter Susan M, Blau Helen M, and Thrun Sebastian. Dermatologist-level classification of skin cancer with deep neural networks. nature, 542(7639):115–118, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Ewbank Michael P, Cummins Ronan, Tablan Valentin, Bateup Sarah, Catarino Ana, Martin Alan J, and Blackwell Andrew D. Quantifying the association between psychotherapy content and clinical outcomes using deep learning. JAMA psychiatry, 77(1):35–43, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Eykholt Kevin, Evtimov Ivan, Fernandes Earlence, Li Bo, Rahmati Amir, Xiao Chaowei, Prakash Atul, Kohno Tadayoshi, and Song Dawn. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1625–1634, 2018. [Google Scholar]
  12. Finlayson Samuel G, Chung Hyung Won, Kohane Isaac S, and Beam Andrew L. Adversarial attacks against medical deep learning systems. arXiv preprint arXiv:1804.05296, 2018. [Google Scholar]
  13. Finlayson Samuel G, Bowers John D, Ito Joichi, Zittrain Jonathan L, Beam Andrew L, and Kohane Isaac S. Adversarial attacks on medical machine learning. Science, 363(6433):1287–1289, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Gong Bangming, Shen Lu, Chang Cai, Zhou Shichong, Zhou Weijun, Li Shuo, and Shi Jun. Bi-modal ultrasound breast cancer diagnosis via multi-view deep neural network svm. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pages 1106–1110. IEEE, 2020. [Google Scholar]
  15. Han Xiao. Mr-based synthetic ct generation using a deep convolutional neural network method. Medical physics, 44(4):1408–1419, 2017. [DOI] [PubMed] [Google Scholar]
  16. Hoefler Torsten, Alistarh Dan, Ben-Nun Tal, Dryden Nikoli, and Peste Alexandra. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554, 2021. [Google Scholar]
  17. Jiang Zhuoran, Chen Yingxuan, Zhang Yawei, Ge Yun, Yin Fang-Fang, and Ren Lei. Augmentation of cbct reconstructed from under-sampled projections using deep learning. IEEE transactions on medical imaging, 38(11):2705–2715, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kingma Diederik P and Ba Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [Google Scholar]
  19. Kontaxis C, Bol GH, Lagendijk JJW, and Raaymakers BW. Deepdose: Towards a fast dose calculation engine for radiation therapy using deep learning. Physics in Medicine & Biology, 65(7):075013, 2020. [DOI] [PubMed] [Google Scholar]
  20. Lakhani Paras and Sundaram Baskaran. Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology, 284(2):574–582, 2017. [DOI] [PubMed] [Google Scholar]
  21. LeCun Yann, Bengio Yoshua, and Hinton Geoffrey. Deep learning. nature, 521(7553): 436–444, 2015. [DOI] [PubMed] [Google Scholar]
  22. Madry Aleksander, Makelov Aleksandar, Schmidt Ludwig, Tsipras Dimitris, and Vladu Adrian. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. [Google Scholar]
  23. Men Kuo, Geng Huaizhi, Zhong Haoyu, Fan Yong, Lin Alexander, and Xiao Ying. A deep learning model for predicting xerostomia due to radiation therapy for head and neck squamous cell carcinoma in the rtog 0522 clinical trial. International Journal of Radiation Oncology* Biology* Physics, 105(2):440–447, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Nguyen Dan, Jia Xun, Sher David, Lin Mu-Han, Iqbal Zohaib, Liu Hui, and Jiang Steve. 3d radiotherapy dose prediction on head and neck cancer patients with a hierarchically densely connected u-net deep learning architecture. Physics in medicine & Biology, 64(6):065020, 2019. [DOI] [PubMed] [Google Scholar]
  25. Nyflot Matthew J, Thammasorn Phawis, Wootton Landon S, Ford Eric C, and Chaovalitwongse W Art. Deep learning for patient-specific quality assurance: Identifying errors in radiotherapy delivery by radiomic analysis of gamma images with convolutional neural networks. Medical physics, 46(2):456–464, 2019. [DOI] [PubMed] [Google Scholar]
  26. Öztürk Şaban. Convolutional neural network based dictionary learning to create hash codes for content-based image retrieval. Procedia Computer Science, 183:624–629, 2021. [Google Scholar]
  27. Öztürk Şaban and Özkaya Umut. Residual lstm layered cnn for classification of gastrointestinal tract diseases. Journal of Biomedical Informatics, 113:103638, 2021. [DOI] [PubMed] [Google Scholar]
  28. Sahiner Berkman, Pezeshk Aria, Hadjiiski Lubomir M, Wang Xiaosong, Drukker Karen, Cha Kenny H, Summers Ronald M, and Giger Maryellen L. Deep learning in medical imaging and radiation therapy. Medical physics, 46(1):e1–e36, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Shen Chenyang, Gonzalez Yesenia, Klages Peter, Qin Nan, Jung Hyunuk, Chen Liyuan, Nguyen Dan, Jiang Steve B, and Jia Xun. Intelligent inverse treatment planning via deep reinforcement learning, a proof-of-principle study in high dose-rate brachytherapy for cervical cancer. Physics in Medicine & Biology, 64(11):115013, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Shen Chenyang, Nguyen Dan, Chen Liyuan, Gonzalez Yesenia, Rafe McBeth Nan Qin, Jiang Steve B, and Jia Xun. Operating a treatment planning system using a deep-reinforcement learning-based virtual treatment planner for prostate cancer intensity-modulated radiation therapy treatment planning. Medical physics, 2020a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Shen Chenyang, Nguyen Dan, Zhou Zhiguo, Jiang Steve B, Dong Bin, and Jia Xun. An introduction to deep learning in medical physics: advantages, potential, and challenges. Physics in Medicine & Biology, 65(5):05TR01, 2020b. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Shen Chenyang, Tsai Min-Yu, Chen Liyuan, Li Shulong, Nguyen Dan, Wang Jing, Jiang Steve B, and Jia Xun. On the robustness of deep learning based lung nodule classification for ct images with respect to image noise. Physics in Medicine & Biology, 2020c. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Shen Wei, Zhou Mu, Yang Feng, Yu Dongdong, Dong Di, Yang Caiyun, Zang Yali, and Tian Jie. Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification. Pattern Recognition, 61:663–673, 2017. [Google Scholar]
  34. Tomori Seiji, Kadoya Noriyuki, Takayama Yoshiki, Kajikawa Tomohiro, Shima Katsumi, Narazaki Kakutarou, and Jingu Keiichi. A deep learning-based prediction model for gamma evaluation in patient-specific quality assurance. Medical physics, 45(9):4055–4065, 2018. [DOI] [PubMed] [Google Scholar]
  35. Wu Weiwen, Hu Dianlin, Cong Wenxiang, Shan Hongming, Wang Shaoyu, Niu Chuang, Yan Pingkun, Yu Hengyong, Vardhanabhuti Varut, and Wang Ge. Stabilizing deep tomographic reconstruction networks. arXiv preprint arXiv:2008.01846, 2020. [Google Scholar]
  36. Wu Weiwen, Hu Dianlin, Niu Chuang, Yu Hengyong, Vardhanabhuti Varut, and Wang Ge. Drone: Dual-domain residual-based optimization network for sparse-view ct reconstruction. IEEE Transactions on Medical Imaging, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Yuan Xiaoyong, He Pan, Zhu Qile, and Li Xiaolin. Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems, 30(9):2805–2824, 2019. [DOI] [PubMed] [Google Scholar]
  38. Zhang Shi-Xiong, Liu Chaojun, Yao Kaisheng, and Gong Yifan. Deep neural support vector machines for speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4275–4279. IEEE, 2015. [Google Scholar]

RESOURCES