A Fundamental Study Assessing the Diagnostic Performance of Deep Learning for a Brain Metastasis Detection Task

Tomoyuki Noguchi; Fumiya Uchiyama; Yusuke Kawata; Akihiro Machitori; Yoshitaka Shida; Takashi Okafuji; Kota Yokoyama; Yosuke Inaba; Tsuyoshi Tajima

doi:10.2463/mrms.mp.2019-0063

. 2019 Jul 26;19(3):184–194. doi: 10.2463/mrms.mp.2019-0063

A Fundamental Study Assessing the Diagnostic Performance of Deep Learning for a Brain Metastasis Detection Task

Tomoyuki Noguchi ^1,^2,^3,^4,^*, Fumiya Uchiyama ¹, Yusuke Kawata ¹, Akihiro Machitori ¹, Yoshitaka Shida ¹, Takashi Okafuji ¹, Kota Yokoyama ¹, Yosuke Inaba ⁵, Tsuyoshi Tajima ¹

PMCID: PMC7553808 PMID: 31353336

Abstract

Purpose:

Increased use of deep convolutional neural networks (DCNNs) in medical imaging diagnosis requires determinate evaluation of diagnostic performance. We performed the fundamental investigation of diagnostic performance of DCNNs using the detection task of brain metastasis.

Methods:

We retrospectively investigated AlexNet and GoogLeNet using 3117 positive and 37961 negative MRI images with and without metastasis regarding (1) diagnostic biases, (2) the optimal K number of K-fold cross validations (K-CVs), (3) the optimal positive versus negative image ratio, (4) the accuracy improvement curves, (5) the accuracy range prediction by the bootstrap method, and (6) metastatic lesion detection by regions with CNNs (R-CNNs).

Results:

Respectively, AlexNet and GoogLeNet had (1) 50 ± 4.6% and 50 ± 4.9% of the maximal mean ± 95% confidence intervals (95% CIs) measured with equal-sized negative versus negative image datasets and positive versus positive image datasets, (2) no less than 10 and 4 of K number in K-CVs fell within the respective maximum biases of 4.6% or 4.9%, (3) 74% of the highest accuracy with equal positive versus negative image ratio dataset and 91% of that with four times of negative-to-positive image ratio dataset, (4) the accuracy improvement curves increasing from 69% to 74% and 73% to 88% as positive versus negative pairs of the training images increased from 500 to 2495, (5) at least nine and six out of 10-CV result sets essential to predict the accuracy ranges by the bootstrap method, and (6) 50% and 45% of metastatic lesion detection accuracies by R-CNNs.

Conclusions:

Our research presented methodological fundamentals to evaluate diagnostic features in the visual recognition of DCNNs. Our series will help to conduct the accuracy investigation of computer diagnosis in medical imaging.

Keywords: bias, brain neoplasms, learning curve, magnetic resonance imaging, neural networks (computer)

Introduction

The recent advances in computer diagnosis systems with deep convolutional neural networks (DCNNs) overcome the difficulties encountered when visual recognition has been attempted by a computed identification system. The radiological diagnostic field might change greatly with DCNNs, ahead of other medical fields. DCNNs have been used for the differentiation of liver masses at dynamic contrast-enhanced CT, the detection of cerebral aneurysms in MR angiography, pulmonary tuberculosis diagnosis, diagnostic mediastinoscopy, pulmonary nodule diagnosis, and more.^1–7 The current issues regarding the dramatic development of computer vision using DCNNs indicate that DCNNs will both compensate for the disadvantages of human intelligence, which are dependent on the operator’s mental and physical condition and consolidate computed diagnosis systems.

However, no standardized evaluation methods for the performance of DCNNs has been established. The diagnostic performance of DCNNs has been evaluated by various methods, which might disturb our understanding of the details of DCNNs. The development of computed diagnosis systems using DCNNs also requires the establishment of the methodology for evaluating the diagnostic performance.

We therefore performed basic verification experiments to investigate the diagnostic features of DCNNs using the detection task of the brain metastasis on MRI images.

Materials and Methods

Study design

This study was approved by our hospital’s Institutional Review Board, which waived the need for written informed consent from the patients.

Patients

As the brain metastases group, we enrolled 162 of consecutive 164 patients with brain metastases located in parenchyma or pia mater who underwent a contrast-enhanced three-dimensional T₁-weighted images with fat suppression in the period from April 2014 to November 2017 (male/female, 92/70; age range/average, 37–90/68.5 years). Their primary malignant neoplasms were lung cancer in 134 patients, breast cancer in 11, digestive cancer in 10, renal cancer in two, and one patient each with myeloma, breast lymphoma, mesothelioma, pheochromocytoma, and primary unknown. The other two patients were excluded because of suboptimal MRI results due to movement artifact.

As the control group, we selected 282 of consecutive 301 patients undergoing head MRI enrolled backwardly from November 2017 to October 2016 and being free from brain metastasis for at least subsequent 6 months confirmed on the electric medical chart (male/female, 163/119; age range/average, 20–91/69.4 years). We enrolled them so that the total slice number of their MRI images was 10 times as that of patients with brain metastases to match the Test 3 plus about two times of those allowing for flexibility of data analysis. The other 19 patients were excluded because of metal artifacts due to old brain vascular diseases in eight patients, dental instrumentation in four, and one patient each with suspected convexity meningioma, suspected acoustic schwannoma, suspected trigeminal schwannoma, basilar aneurysm, parietal venous malformation observed on MRI images.

Magnetic resonance imaging

We investigated MRI examinations performed on the three MRI units at our hospital, a MAGNETOM Avanto (Siemens AG, Erlangen, Germany), an MRT200SP5 (Canon Medical Co. Ltd., Otawara, Japan), and a MAGNETOM Verio (Siemens AG). We allowed to use MRI images obtained on three different MRI units because DCNNs should be trained so as to be tolerable for the difference in the image quality among MRI units. We used contrast-enhanced three-dimensional T₁-weighted images with fat suppression under following parameters which were optimized for MRI units in each. The detailed parameters were as followed: TR/TE (ms) = 3.92–15/1.44–4.19; flip angle (°) = 10–18; slice thickness/spacing between slices (mm) = 1.0–2.0/−1.0–1.0; matrix = 188–256; pixel spacing (mm) = 0.5–1.0; number of slices = 144–288. The gadolinium (Gd) contrast agents that were intravenously injected with optimized volumes depending on the patient’s conditions were 0.1 mmol/kg Gd-DO3A-butrol, 0.1 mmol/kg Gd-DTPA, or 0.1 or 0.2 mmol/kg Gd-HP-DO3A.

Image data processing

We tested two classifiers with DCNN architecture, AlexNet and GoogLeNet,^8–10 both of which were already trained on more than a million images from the ImageNet database (ImageNet. http://www.image-net.org) and distributed as add-on software of MATLAB (ver. 2017b; MathWorks Inc., Natick, MA, USA). We used a custom-build image processing computer (TEGARA Corporation, Shizuoka, Japan) containing a Quadro P2000 5 GB graphics processing unit (Nvidia Corporation, Santa Clara, CA, USA), an Intel Xeon E5-2680v4 2.40 GHz processor (Intel Corporation, Santa Clara, CA, USA), 1.0 TB of hard disk space, and 64 GB of RAM.

We chose 3117 positive images which included any cross-sections of enhanced tumor lesions regardless of its locations or lesion numbers extracted from the brain metastases group (Fig. 1). We allowed to use multiple slices of the same subject which may be higher correlation within the mutual slices, based on the following reasons: (1) It was difficult to select the most appropriate slice out of the whole brain images in patients with multiple brain metastases. (2) We did not adopt the data augmentation method in which the original image was rotated, cropped, or deformed differently, to increase the data size for training DCNNs allowing to have higher correlation between the original image and the augmented images. Instead, we expected multiple slices of the same subject having the favorable effect similar to the data augmentation method.

Fig. 1 — The total of 3117 slices of positive images include any cross-sections of enhanced tumor lesions regardless of the lesion locations or lesion numbers, such as a ring-like enhanced lesion (a: arrowhead), a minimal dot-like enhanced lesion (b: arrowhead), or multiple enhanced lesions of various cross-sections in the same slice (c: arrowheads).

We extracted 37961 negative images including whole brain parenchyma from the control group which were about 12 times of the number of positive images as mentioned above.

AlexNet and GoogeLeNet used in the current study did not support Digital Imaging and Communications in Medicine (DICOM) format. Therefore, using the free software program Fiji (https://imagej.net/Fiji), we converted these images from DICOM format into two types of image formats with image sizes supported by DCNNs: Portable Network Graphics (PNG) with 16-bit with 227 × 227 pixels for use in AlexNet, and Joint Photographic Experts Group (JPEG) format with 224 × 224 pixels for use in GoogLeNet. In the processing of converting the image format, we adjusted the brightness and contrast of the original DICOM image using a function of look-up-table equipped in Fiji.

Two radiologists (F.U. and T.N.) extracted the negative and positive images and drew 9684 of free-sized bounding boxes around the metastatic lesions over all 3117 of the positive images using the annotation application running on MATLAB.

Since the CNNs pretrained with non-medical images could serve as an effective image recognition baseline for classification of the medical images,^4,11,12 we used the transfer learning method as follows. In both AlexNet and GoogLeNet, we separately displaced the last three fully-connected layers with three new layers consisting of a fully-connected layer, a softmax layer, and a classification output layer. The weights of the learning rate factor and the weight of the bias learning rate factor in the fully-connected layer were set to 20 and 20, respectively. The classification output layer had two outputs corresponding to the positive or negative categories. We used 0.0001 for the initial learn rate, 10 for the mini-batch size, and 100 for the max epochs as the parameters of the training options. The imaging data augmentation method was not used.

Statistical and data analysis

We performed the following tests.

Test (1): Bias in the diagnostic performance of the DCNNs

Before measuring the diagnostic ability to distinguish the positive images from the negative images, we investigated the potential bias of the diagnostic performance in DCNNs. We first prepared a set of the image data consisting of a pair of the 1558 equal-sized randomly-assigned negative images. Namely, this dataset was composed of two subsets in both of which only negative images were assigned. We then performed a standard K-fold cross-validation test (K-CV) under K numbers ranging from 3 to 20^13–17 as follows; (1) splitting the dataset into K equal parts, (2) setting the ratio of the data as (1/K), (1/K), and (1–2/K) for the testing, validation, and training image numbers, respectively, (3) training DCNNs with the training images, (4) validating DCNNs with the validation images to prevent the over training, (5) testing DCNNs with the testing images to evaluate the judgment ability, (6) repeating K times, each selecting a different testing dataset, and (7) finally determining the mean accuracy by averaging the K test results and 95% confidence interval (95% CI) by calculating the K test results with t-distributions. Since the mean accuracy should be 50% in the current tests, the deviation of the mean accuracy ± 95% CI from 50% could indicate the potential bias of the diagnostic performance of DCNNs.

In the same way, a set of the image data consisting of a pair of the 1558 equal-sized randomly-assigned positive images was subjected to the same investigation.

Test (2): Variation of K number in the K-fold cross-validation method

As K number in K-CVs increases, the reliability of the accuracy improves, and the accuracy itself also escalates due to the increase in the image number of training data. For the assessment of the variance of the diagnostic accuracy, we performed standard K-CVs using 3117 pairs of positive and negative images under K numbers were set from 3 to 20. To reduce K number effects in K-CVs, we also performed modified K-CVs in which the training images were fixed to 1000 pairs of positive and negative images in the ratio of (1/K), (1/K), and (1–2/K) for the testing, verification, and training image numbers. We also measured the mean accuracy ± 95% CI in each of K-CVs and determined the optimal K-value in light of the bias obtained in Test 1.

Test (3): The negative-to-positive image data ratio

A difference in the negative-to-positive image data ratio can affect the diagnostic performance.^18,19 We prepared 10 datasets of the images consisting of the positive images with the number of constant 3000 and negative images with the number of 750, 1000, 1500, 2000, 3000, 6000, 12000, 18000, 24000 or 30000 in a random extraction method. Namely, the numbers of negative images were assigned at one-fourth, one-third, one-half, two-thirds, one, two, four, six, eight, and 10 times of the positive images. We tested these 10 datasets by 10-CV with a modified testing procedure in which the 300 pairs of negative and positive images were constantly assigned to each of the validation images and the testing images instead of using the ratio of (1/K), (1/K), and (1–2/K) for the testing, validation, and training image numbers, respectively, to avoid the inconsistency of the number of significant figures in the results. We here adopted K-CV with 10 of K number based on the result of the optimal K number investigation determined in Test 2. We measured mean accuracies, area under the curves (AUCs), sensitivities, and specificities.

Test (4): The accuracy improvement curves

As the number of the training images to train DCNNs increases, the mean accuracy obtained with 10-CV increases. Although it is difficult to estimate the amount of the training images enough to get the best performance of DCNNs, we might judge whether or not the current number of the training images is sufficient to fully demonstrate the ability of DCNNs. We prepared five datasets of 624, 1250, 1832, 2362, 3117 pairs of the randomly extracted positive and negative images, which were adjusted to meet 500, 1000, 1500, 2000, and 2495 pairs of the training images with the ratio of (1/K), (1/K), and (1–2/K) for the testing, verification, and training image numbers for each of the datasets. We then performed 10-CVs to obtain the mean accuracies. Finally, we plotted them to obtain the accuracy improvement curves. Based on the result in Test 2, we adopted the number of the training images as the index of increase of the dataset and K-CV with 10 of K number. Here, the results of the paired 1000 and 2495 training images by 10-CV were reused from those of standard 10-CVs and modified 10-CVs in Test 2, respectively.

Test (5): Accuracy prediction by the bootstrap method

K-CV is to repeat K times to train and test DCNNs, which needs the long-term possession of the computational resources. If some of those operations can be skipped, it may save the resource-consuming process in K-CV. Here, we explored the time- and resource-saving validation method for 10-CV.

We created nine combinational groups comprising ₁₀C₁, ₁₀C₂, ₁₀C₃, …, and ₁₀C₉ (i.e., 10, 45, 120, …, and 10) datasets out of 10 result sets obtained in Test 2, and those simple combinational groups were hereinafter referred to as ₁₀C₁, ₁₀C₂, ₁₀C₃, …, and ₁₀C₉ groups, respectively. We then calculated 95% CI of each of the groups by the bootstrap method.²⁰ When given ₁₀C_N group had no <95% of the number rates of 95% CI ranges including the mean accuracy determined by 10-CV in Test 2, given N datasets out of 10 results sets were judged to predict the range of the accuracy for 10-CV.

Meanwhile, given N datasets out of 10 results sets can be propagated to N subgroups comprising _NC₁, _NC₂, …, and _NC_N of datasets. We first calculated the accuracy in each of those subgroups by the bootstrap method, and next sorted those subgroups by the accuracy, and then extracted the median numbers from the subgroups. We finally replaced the initial ₁₀C₁, ₁₀C₂, ₁₀C₃, …, and ₁₀C₉ datasets with those median numbers, respectively. Those propagated combinational groups were hereinafter called as the median ₁₀C₁, median ₁₀C_1–2, …, and median ₁₀C_1–9 groups, respectively. Then, we calculated 95% CIs of those groups by the bootstrap method and scored their number rates as mentioned above.

We also compared the mean processing times per one operation of 10-CV to those per one range of 95% CIs obtained from the simple and propagated combinational groups by the bootstrap method for AlexNet and GoogLeNet, respectively, to confirm if the bootstrap method could reduce the processing time.

In the calculation of the accuracies and 95% CIs by the bootstrap method in the current test, the bias-corrected and accelerated bootstrap interval method was adopted.²⁰

Test (6): Metastatic lesion detection by R-CNN

Regions with CNN (R-CNN) is one of the applications derived from DCNN and proposes a ROI.²¹ We assessed the metastatic lesion detection abilities of AlexNet- and GoogLeNet-mounted-R-CNNs with 10 result sets determined in Test 2 by 10-CV.

Because we could not define the ratio of validation and training image numbers which R-CNNs automatically handled, the positive training and validation images were used for training R-CNNs in each of those datasets in conjunction with coordinate data of the true bounding boxes as mentioned above. The positive and negative testing images were used for the evaluation of the metastatic lesion detection ability of R-CNNs. Namely, the ratio of (1/K) and (1–1/K) for the testing and training image numbers was adopted for each of the datasets.

We performed two types of tests; the metastatic lesion detection tests of R-CNNs and the metastatic lesion detection tests in conjunction with the primary selection test. In the latter, according to the judgement of the primary selection test adopted from the results obtained in Test 2, the positively judged testing images were further evaluated by R-CNNs, and the negatively judged testing images were determined to have no bounding box predicted by R-CNNs.

For the positive testing images, we scored the number rates of the true bounding box covered by the entire or part of the entire predictive bounding boxes out of the number of the true bounding box to avoid missing brain metastases in spite of the considerable increase in false positives. For the negative testing image, zero or one out of one point was scored if any or no predictive bounding boxes were proposed by R-CNNs, respectively. We adopted this scoring system in which the higher scores were obtained the more bounding boxes were placed because we gave greater importance to the sensitivity for the metastatic lesion detection than the specificity and R-CNNs proposed the bounding boxes according to their computational calculation. Representative judgement examples were shown in Fig. 2.

Fig. 2 — Representative examples of the judgement of lesion detection by regions with convolutional neural network (R-CNN) in Test (6). The two yellow bounding boxes (1 and 2) in a positive image indicate two lesions of brain metastases (a), and the two of green areas in the same positive image reflect the predicted lesions by R-CNN (b). Bounding box (1) is covered by one of the two green areas, which gets 1 of 1 point. Bounding box (2) is not covered by either of green areas, which gets 0 of 1 point. In the negative testing image (c), 0 out of 1 point was scored although two predictive green areas were proposed by R-CNN (d).

Results

Diagnostic bias in the DCNNs

Figure 3 illustrated the results of diagnostic bias in AlexNet and GoogLeNet, respectively. The mean accuracies ± 95% CIs in classifying the equal-sized negative image dataset held within 50 ± 3.9% in AlexNet and 50 ± 4.9% in GoogLeNet in any K numbers of K-CVs (Fig. 3a). The mean accuracies ± 95% CIs in classifying the positive image dataset were within 50 ± 4.6% in AlexNet and 50 ± 4.4% in GoogLeNet compared to those in the positive image dataset (Fig. 3b).

Fig. 3 — The variation of mean accuracy in classification of the equal-sized negative (a) and positive image datasets (b) judged by AlexNet (Δ) and GoogLeNet (•). The maximal biases from the true error rate of 50% in negative and positive image datasets are 4.6% in AlexNet with the positive image dataset and 4.9% in GoogLeNet with the negative image dataset. CI, confidence interval.

The optimal K number in K-fold cross-validation method

Figure 4 illustrated the mean accuracies and 95% CIs of K-CVs in AlexNet and GoogLeNet. When K number was increased from 3 to 20, AlexNet showed that the mean accuracy of standard K-CVs increased from 69% to 75% (y = 1.15 × 10⁻³x + 0.73 in single regression analysis), while it demonstrated that the mean accuracy of modified K-CVs with 1000-fixed training images changed less from 73% to 72% (y = −3.60 × 10⁻³x + 0.72). GoogLeNet showed that the mean accuracy increased from 78% to 88% in standard K-CVs (y = −5.97 × 10⁻³x + 0.83) and held from 80% to 80% in modified K-CVs (y = −0.170 × 10⁻³x + 0.80) (Fig. 4a).

95% CIs of standard and modified K-CVs for AlexNet and GoogLeNet tended to diminish as K number increased (Fig. 4b). K-values with 10 or more in AlexNet or four or more in GoogLeNet had 95% CIs of K-CVs that fell within the respective maximum bias of 4.6% or 4.9% determined in Test 1.

Ratio of the negative-to-positive image dataset

The results of the variation of statistical values in the ratio of the negative-to-positive image dataset in AlexNet and GoogLeNet were provided in Fig. 4. As the negative image data increased, the specificity increased but the sensitivity decreased, and then they intersected at the boundary around the same size of the negative images at which mean accuracies, AUCs, sensitivities, and specificities were 74%, 0.841, 70%, and 78% in AlexNet, and 87%, 0.961, 90%, and 84%, in GoogLeNet, respectively. Afterward, AlexNet showed the decrease of the mean accuracy, sensitivity, and mean AUC, but not specificity (Fig. 5a). GoogLeNet demonstrated 91%, 0.983, 82%, and 99%, respectively, at four times of negative-to-positive image ratio dataset with the highest mean accuracy and AUC (Fig. 5b).

Fig. 5 — The variation of the statistical values of 10 datasets of the image data consisting of the number of negative images randomly assigned at one-fourth, one-third, one-half, two-thirds, one, two, four, six, eight, and 10 times of 3000 positive images judged by AlexNet (a) and GoogLeNet (b). As the negative image data increases, the specificity increases but the sensitivity decreases, and then they intersect at the boundary around the same size of the negative images. Afterward, GoogLeNet demonstrates the highest mean accuracy and AUC at four times of negative-to-positive image ratio dataset. AUC, area under the curve.

The accuracy improvement curve

Figure 6 showed the results of the accuracy improvement curves in 1-CV in AlexNet and GoogLeNet. When the number of training images were changed to 500, 1000, 1500, 2000, and 2495, AlexNet showed the increased mean accuracies of 69%, 72%, 70%, 78%, and 74% with the increasing rate of 3.3% per a positive versus negative pair of 1000 training images (y = 3.33 × 10⁻⁵x + 0.67). GoogLeNet demonstrated the higher mean accuracies of 73%, 80%, 83%, 86%, and 88%, in conjunction with a higher increasing rate of 8.3% per a positive versus negative pair of 1000 training images (y = 8.31 × 10⁻⁵x + 0.69) compared with AlexNet.

Fig. 6 — The accuracy improvement curves in 10-cross validation (CV) in AlexNet and GoogLeNet. AlexNet and GoogLeNet show the increasing rate of 3.3% per 1000 training images (y = 3.33 × 10⁻⁵x + 0.67; Δ) and 8.3% per 1000 training images (y = 8.31 × 10⁻⁵x + 0.69; •), respectively.

Accuracy prediction by the bootstrap method

Table 1 summarized the results of the range prediction of the mean accuracy of 10-CV determined in Test 2 in AlexNet and GoogLeNet using the bootstrap method. None of the simple combinational groups in AlexNet or GoogLeNet reached more than 95% of the number rates of 95% CI ranges including the mean accuracy of 10-CV in Test 2. In the propagated combinational groups, however, the median ₁₀C_1–9 group in AlexNet and the median ₁₀C_1–6, ₁₀C_1–7, ₁₀C_1–8, and ₁₀C_1–9 groups in GoogLeNet showed more than 95% of the number rates.

Table 1.

The range of mean accuracy prediction by the bootstrap method

Type of group		AlexNet		GoogLeNet

		Number rate of 95% CI ranges including the mean accuracy of 10-CV (%)	Mean processing time to obtain one of 95% CIs (s)	Number rate of 95% CI ranges including the mean accuracy of 10-CV (%)	Mean processing time to obtain one of 95% CIs (s)
Simple combinational group	₁₀C₁	30 (3/10)	0.30	70 (7/10)	0.40
	₁₀C₂	44 (20/45)	0.21	67 (30/45)	0.28
	₁₀C₃	44 (53/120)	0.18	70 (84/120)	0.25
	₁₀C₄	49 (103/210)	0.17	71 (148/210)	0.25
	₁₀C₅	50 (125/252)	0.18	77 (193/252)	0.25
	₁₀C₆	59 (123/210)	0.19	77 (162/210)	0.26
	₁₀C₇	67 (80/120)	0.21	88 (105/120)	0.29
	₁₀C₈	67 (30/45)	0.23	82 (37/45)	0.31
	₁₀C₉	60 (6/10)	0.25	90 (9/10)	0.32
Propagated combinational group	Median ₁₀C₁	30 (3/10)	0.30	70 (7/10)	0.40
	Median ₁₀C_1–2	44 (20/45)	0.64	67 (30/45)	0.84
	Median ₁₀C_1–3	58 (69/120)	1.27	79 (95/120)	1.75
	Median ₁₀C_1–4	64 (134/210)	2.59	88 (185/210)	3.69
	Median ₁₀C_1–5	69 (174/252)	5.51	94 (237/252)	7.81
	Median ₁₀C_1–6	73 (153/210)	12.07	99 (207/210)	16.64
	Median ₁₀C_1–7	83 (100/120)	26.77	98 (118/120)	36.44
	Median ₁₀C_1–8	87 (39/45)	58.82	100 (45/45)	79.02
	Median ₁₀C_1–9	100 (10/10)	125.30	100 (10/10)	165.80

Open in a new tab

CV, cross validation; CI, confidence interval.

About 393 and 3502 s of the mean processing times per one operation of 10-CV for AlexNet and GoogLeNet, respectively, were longer than 0.3 and 0.4 s or 125 and 166 s of the maximum of the mean processing times per one range of 95% CIs obtained from simple combinational groups for AlexNet and GoogLeNet, respectively.

Metastatic lesion detection by R-CNN

Table 2 provided the results of the metastatic lesion detection by R-CNN in AlexNet and GoogLeNet. The best metastatic lesion detection ability was observed in GoogLeNet-mounted-R-CNN with the primary selection test by GoogLeNet at which the mean accuracy, sensitivity, and specificity, positive predictive value, and negative predictive value were 50%, 28%, 95%, 92%, and 39%, respectively. The second-best lesion detection ability was by the AlexNet-mounted-R-CNN with the primary selection test by AlexNet having 45%, 27%, 83%, 77%, and 35%, respectively. The detectability performance of AlexNet-mounted-R-CNNs without the primary selection test showed 31%, 33%, 29%, 49%, and 17%, respectively, which was similar to that of the GoogLeNet-mounted-R-CNN without the primary selection test with 31%, 29%, 35%, 48%, and 19%, respectively.

Table 2.

Lesion detection by R-CNNs

Statistical parameter	R-CNN (%)		R-CNN with primary screening test by DCNN (%)

	AlexNet	GoogLeNet	AlexNet	GoogLeNet
Accuracy	31	31	45	50
Sensitivity	33	29	27	28
Specificity	29	35	83	95
PPV	49	48	77	92
NPV	17	19	35	39

Open in a new tab

PPV, positive predictive value; NPV, negative predictive value; R-CNN, regions with convolutional neural network; DCNN, deep convolutional neural network.

Discussion

We performed the fundamental investigations of DCNNs regarding diagnostic biases, the optimal negative-to-positive image ratio, the optimal K number of K-fold cross validation, the accuracy improvement curve, the accuracy range prediction by the bootstrap method, and lesion detection by R-CNNs using the detection task of brain metastasis on MRI images.