Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2020 Apr 29;15(4):e0232376. doi: 10.1371/journal.pone.0232376

Deep learning in rare disease. Detection of tubers in tuberous sclerosis complex

Iván Sánchez Fernández 1,*, Edward Yang 1, Paola Calvachi 2, Marta Amengual-Gual 1, Joyce Y Wu 3, Darcy Krueger 4, Hope Northrup 5, Martina E Bebin 6, Mustafa Sahin 1, Kun-Hsing Yu 7, Jurriaan M Peters 1; on behalf of the TACERN Study Group
Editor: Kaiming Li8
PMCID: PMC7190137  PMID: 32348367

Abstract

Objective

To develop and test a deep learning algorithm to automatically detect cortical tubers in magnetic resonance imaging (MRI), to explore the utility of deep learning in rare disorders with limited data, and to generate an open-access deep learning standalone application.

Methods

T2 and FLAIR axial images with and without tubers were extracted from MRIs of patients with tuberous sclerosis complex (TSC) and controls, respectively. We trained three different convolutional neural network (CNN) architectures on a training dataset and selected the one with the lowest binary cross-entropy loss in the validation dataset, which was evaluated on the testing dataset. We visualized image regions most relevant for classification with gradient-weighted class activation maps (Grad-CAM) and saliency maps.

Results

114 patients with TSC and 114 controls were divided into a training set, a validation set, and a testing set. The InceptionV3 CNN architecture performed best in the validation set and was evaluated in the testing set with the following results: sensitivity: 0.95, specificity: 0.95, positive predictive value: 0.94, negative predictive value: 0.95, F1-score: 0.95, accuracy: 0.95, and area under the curve: 0.99. Grad-CAM and saliency maps showed that tubers resided in regions most relevant for image classification within each image. A stand-alone trained deep learning App was able to classify images using local computers with various operating systems.

Conclusion

This study shows that deep learning algorithms are able to detect tubers in selected MRI images, and deep learning can be prudently applied clinically to manually selected data in a rare neurological disorder.

Introduction

Tuberous sclerosis complex (TSC) is a genetic neurocutaneous syndrome with an incidence of 1/6,000 to 1/10,000 live births and a population prevalence of 1/12,000 to 1/25,000 [13]. The number, size, and morphology of tubers vary widely between individuals [4, 5] and, rarely, TSC can be clinically paucisymptomatic or with only subtle findings on MRI, such as a single lesion [6]. Automating tuber detection in brain MRIs can enhance diagnostic certainty in resource-rich areas and facilitate diagnosis in areas where medical specialists are not readily available.

Convolutional neural networks (CNNs) automatically detect patterns of interest in images and have demonstrated image-classification performance at or above the level of humans [7], including detection of diabetic retinopathy [8], skin cancer [9], echocardiography findings [10], and acute neuroimaging findings [11] at the level of specialist physicians. These studies required many thousands of images to train the CNNs, which are challenging to obtain in rare neurological disorders like TSC and make computerized support of rare disorders difficult to develop. We could not find studies using deep learning to detect tubers in TSC. The application of CNNs in clinical practice is also frequently limited because of privacy concerns.

This study aims to demonstrate that CNNs can be successfully developed for detection of rare brain anomalies on MRI such as tubers using a relatively small number of training images. It also aims to demonstrate how CNNs can be implemented using a thin-client model such that a clinician can use advanced classifier tools at the point of care without the need to transfer patient sensitive data to a third party computer system.

Patients and methods

Ethical approval

The Internal Review Board at Boston Children’s Hospital approved this study (IRB-P00029015) and determined that it met the regulatory requirements to obtain a waiver of inform consent/authorization from research subjects as this study was a secondary use of already existing data collected primarily for clinical reasons.

Study design

This study applies CNNs, a type of neuronal network developed for computer vision, to detect tubers in brain MRI images. For a more in-depth overview of CNNs, transfer learning, data augmentation, and visualization techniques relevant to this article, please see Supplementary Methods at: https://ivansanchezfernandez.github.io/TSC_supplementary_methods/. All supplementary files at dx.doi.org/10.17504/protocols.io.bdt3i6qn

Patients

Our population of interest consists of children and adolescents with TSC and tubers visible on their MRI. Our representative sample consisted of patients with TSC followed at the Multidisciplinary Tuberous Sclerosis Clinic and controls with normal MRI from Boston Children’s Hospital, a tertiary pediatric center with a heterogeneous sample of patients with TSC of a wide variety of ages, severities, and comorbidities. The inclusion criteria for patients with TSC were: 1) patients with a confirmed diagnosis of TSC following the 2012 International Tuberous Sclerosis Complex Consensus Conference Diagnostic Criteria [12], 2) who had at least one brain MRI at Boston Children’s Hospital, 3) who had both T2-axial and FLAIR-axial sequences available, and 4) who had tubers detected clearly on MRI as per the radiology report during routine clinical care. The inclusion criteria for the control group were: 1) patients who had at least one brain MRI at Boston Children’s Hospital, 2) the MRI was interpreted by the neuroradiologist as normal or with non-specific findings during routine clinical care. Common indications for obtaining an MRI in controls were headache, concussion, non-syndromic mild developmental delay, and treatment-responsive epilepsy. When several brain MRIs were available per patient, we selected the most recent MRI to minimize the decreased lesion contrast associated with immature myelination. When a patient developed hydrocephalus or underwent shunt placement or epilepsy surgery we selected the last brain MRI before such interventions.

MRI findings in TSC

Cortical tubers are not the only findings of TSC in brain MRI [5]. Subependymal nodules and white matter radial migration lines were also present in some of our MRI images. However, since tubers were present in all images labeled as TSC and not present in any of the images labeled as controls, we expected the CNN to detect tubers as the pattern that differentiated TSC patients from controls. On brain MRI, tubers are moderately well-circumscribed areas of increased signal intensity on T2-weighted and fluid-attenuated inversion recovery (FLAIR) images. The cortex overlying the tuber may have features of malformation, and the gray-white matter differentiation is reduced [5].

MRI sequences, image labeling, and division into training, validation, and testing

We only selected two-dimensional axial T2 and FLAIR sequences for both patients with TSC and controls. MRIs were collected for clinical reasons. The objective of the study was to develop a deep learning algorithm to detect tubers in MRI images (not to identify from patients with TSC from any image, as not all individual slices from patients with TSC contain lesions). For the MRI of each patient with TSC, a pediatric neurologist (ISF) selected several MRI slices in axial planes with obvious tubers in them. For the controls, the same pediatric neurologist selected MRI slices at approximately the same level in the brain that for patients with TSC. The images of 138 patients (69 TSC and 69 controls) were used for training the model, of 40 patients (20 TSC and 20 controls) were used to validate the model, and of 50 patients (25 TSC and 25 controls) were used to test the model. There was no patient overlap between the training, validation, and test sets. We developed and validated the model based only on labels derived from routine clinical radiology reports to demonstrate that these models can be developed with data generated from routine clinical care. The labels (TSC or control) for the images in the testing set were independently confirmed by a clinical neuroradiologist (EY).

Minimizing overfitting

CNNs are complex mathematical functions with a huge number of parameters, which allow them to fit well complex datasets but, at the same time, makes them prone to fit the data too well with poor generalization. We used several techniques to minimize overfitting:1) keeping training, validation, and test sets completely independent of each other (held-out cross-validation, as is the standard approach in CNN); the test set consist of data the final deep learning model was never exposed to before and, therefore, it is a good evaluation of how well the final deep learning model will do on new data, 2) using random noise, batch normalization, dropout, and global average pooling in the CNN architectures, and 3) using data augmentation (https://ivansanchezfernandez.github.io/TSC_supplementary_methods/).

Data augmentation

We augmented the training set by creating approximately 4 copies of each original image by randomly allowing shifts, horizontal flips, and rotations. Data augmentation of the training set is a standard training approach in CNN which makes CNNs more robust to the essential features of each class and less sensitive to particularities of the individual images used for CNN training such as laterality (left or right), location of the tuber in the image (for example, upper part of the image or lower part of the image), rotation, etc. [13]. In addition, data augmentation of the training set has a crucial role to obtain good CNN performance using relatively small datasets like the present one, effectively increasing the size of the training set for the CNN to learn without increasing the number of subjects [13]. We augmented our training data with Image data generator from Keras [14] using the following parameters: rotation range: 30 degrees, width and height shift range: 0.1, shear range: 0.2, zoom range: 0.25, we allowed horizontal flip; and we used nearest neighbors as the fill mode. The validation and test sets were not augmented.

Variables

The primary outcome was the accuracy of the deep learning algorithm to correctly classify the MRI image as TSC or control in the test set (using the best performing CNN model in the validation set, as explained below). The secondary outcome was the ability of the deep learning algorithm to detect tubers within each image in the test set.

Model development

CNNs process data in the form of arrays. In the case of two-dimensional images, each array is three-dimensional with two dimensions for width and height and one dimension for the three color channels (red, green, and blue), even when the MRI image is grayscale. Each numerical value in the three-dimensional array represents the pixel intensity in the three color channels to construct the grayscale MRI image. We downscaled the input size of the images to 224x224x3 and normalized pixel values between 0 and 1, which are standard transformations for input data used in CNNs. A CNN has an architecture with several layers so that the inputted image is transformed in several steps to eventually yield a class prediction [15]. The main types of layers are convolutional layers, pooling layers, and dense (fully-connected) layers. Convolutional layers “scan” different areas of the image trying to find spatial patterns, for example, edges. Pooling layers ensure that the local conjunction of features from the previous layers are detected regardless of their location in the image (translation and rotation invariance). The convolutional and pooling layers automatically perform most of the feature extraction and feature transformation and, towards the end of the CNN architecture, the image is flattened to a one-dimensional vector. Afterward, fully-connected layers try to translate the feature vector into probabilities of the original image belonging to one of the classes [15]. Each CNN has thousands to millions of parameters to be tuned iteratively. These parameters have no direct interpretability, they are just weights in a complex mathematical function with no intuitive interpretation. During the forward pass the initial weights yield predictions. These predictions are compared with the ground-truth image labels and an error measure is calculated. The error is “backpropagated” with partial derivatives, so that the “responsibility” of each parameter in the error is calculated and the parameter is slightly changed in the direction that reduces that error. Through multiple forward passes and backpropagation steps, the parameters are tuned to optimize classification [15]. We tried three different CNN architectures: 1) Tuberous sclerosis complex convolutional neural network (TSCCNN), a relatively simple architecture that we developed with 4 blocks, each of them consisting of several convolutional layers followed by a pooling layer, and a final block of fully-connected layers (https://ivansanchezfernandez.github.io/TSC_supplementary_methods/), 2) InceptionV3, a popular architecture within the family of CNNs that parallelize computations in a split-transform-merge approach to increase depth and improve accuracy while keeping computations efficient [16, 17], and 3) ResNet50, a popular architecture within the family of residual CNNs that allow very deep CNNs by using blocks of layers that behave like relatively shallow classifiers and work together as an ensemble to produce a very good classifier [1820]. The initial weights for TSCCNN were random weights, while the initial weights for InceptionV3 and ResNet50 were the transfer learning weights used during the ImageNet competition [21]. Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has been already been learned and can potentially help performance in relatively small datasets [21]. Initialization weights are tuned in each step of backpropagation during training. However, initialization weights already trained to identify unrelated items in different images are easier to train than initial random weights because image recognition largely implies detection of edges and combination of these edges [21]. For all CNNs, we used Adam [22] as an optimizer with a learning rate of 0.00025, we used binary cross-entropy as loss function, and a batch size of 64 with 100 epochs.

Model validation

The validation set evaluates how well the CNNs would generalize to images they were not trained on. We selected the most generalizable model as the one with the lowest binary cross-entropy loss in the validation set. Binary cross-entropy loss is a more granular and more stable measure of generalizability than accuracy, since it is a continuous measure as opposed to a dichotomous measure for classification.

Model testing

While the purpose of model validation is to select the potentially most generalizable CNN, the purpose of model testing is actually testing generalizability in a completely new set of images: the test set. The primary outcome of classification accuracy was calculated as the proportion of images in the test set labeled correctly by the model. There was no patient overlap between the training, validation, and test sets, therefore the images in the test set came from patients who were not used in the training set or validation set and to which the CNNs were never exposed before.

Model development versus clinical practice: Clinical cases

The development of a CNN entails obtaining observations (in our study, MRI images) and their labels (in our study, images containing tubers or images derived from a normal MRI) by a human to first train, then validate, and finally test the model. However, once the model has been trained, validated, and tested, its use in clinical practice is to assist a clinician in the recognition of new observations as belonging to one of the labels (diagnoses). In our study, that would mean recognizing whether an MRI image has tubers in it or not. As an illustration on how we envision the clinical use of our deep learning algorithm, we further evaluated the selected deep learning architecture in a series of 259 consecutive MRI T2 and FLAIR axial images from 6 additional patients with TSC. These images were not used in training, validation, or testing because of challenging characteristics, such as extremely subtle tubers or limited myelination.

Model visualization

Machine learning can sometimes operate as a “black box” where it is not possible to know which features are used in the classification process. However, model visualization in CNN helps clarify the most relevant features used for classification. For example, a deep learning model used to diagnose disease from chest X-rays classified images as having pneumothorax because of the presence of a chest tube in the image [23]. Similarly, some deep learning models focused on areas in the image that indicated the origin of the image as hospital-based (portable X-ray) versus images obtained in the doctor’s office to classify chest X-rays as normal or pathological [24]. Intuitive and graphical indications on how the CNN is making decisions helps users gain confidence in the model. Thus, to identify potential correct classifications based on incorrect features and to gain some intuition into the classification process, we identified image pixels most relevant for classification with gradient-weighted class activation maps (Grad-CAM) and saliency maps. Grad-CAM uses the gradient of the output category (TSC or control) to the last convolutional layer (the last layer with spatial information) to provide a coarse localization of the areas of most interest for classification (where the CNN is “preferentially looking at for making decisions”) [25]. The objective of Grad-CAM and saliency maps is to make sure the CNN is classifying MRI slices based on what it is supposed to. The objective of this project was not to segment the patterns of interest within each image [26]. Saliency maps compute the gradient of the output category (TSC or control) to the original image, that is, they identify the pixels in the original image that, if changed, would modify most the probability of the image belonging to the class [27]. Roughly, class activation maps and saliency maps can be interpreted as “where the CNN is looking at” or, more precisely, which areas in the image are most important for classification into TSC versus control. Map visualizations are heatmaps of the gradients with the “hotter” colors representing the regions of most importance for classification.

Statistical analysis and software

We summarized demographic and basic clinical data with descriptive statistics and calculated basic comparisons with rank sum Wilcoxon test for continuous variables and Fisher’s exact test for categorical variables. For all CNNs, we used Adam [22] as an optimizer with a learning rate of 0.00025, we used binary cross-entropy as loss function, and a batch size of 64 with 100 epochs. We used Python version 3.6 [28] as programming language and within this language, its modules: Pandas [29], NumPy [30] and SciPy.Stats [31], to perform descriptive statistics, Keras [14] and scikit-learn [32] to build and analyze the CNNs, TensorFlow [33] as backend for CNNs, and Keras-vis for visualization of class activation maps and saliency maps [34]. We used Jupyter notebooks to run and save the code and results [35]. We trained and validated our models with anonymized MRI images in a cloud computing system and tested the selected model in our local computer. We used a default threshold of 0.5: if the estimated probability was greater than 0.5, the MRI slice was classified as having tubers, otherwise it was classified as not having tubers.

Results

Demographic and clinical features

The TSC patients and controls included in the study were similar, except TSC patients were slightly younger than controls at the time of imaging (median 9.5 years versus 12.4 years) with no statistically significant difference between the groups (Table 1). The data were divided into 566 images for TSC training (69 patients), 130 images for TSC validation (20 patients), 210 images for TSC testing (25 patients), 561 images for control training (69 patients), 118 images for control validation (20 patients), and 226 images for control testing (25 patients). The training file, merging the TSC and control training images and after data augmentation contained 5,634 images. The validation file, merging the TSC and control validation images contained 248 images. The testing file, merging the TSC and control images contained 436 images.

Table 1. Demographic features in our population.

TSC Control Test statistic and p-value
Age in years [median (p25-p75)] 9.5 (5–15.3) 12.4 (6.9–15.7) Wilcoxon rank sum test: -1.51
p-value: 0.13
Sex (male:female) 64:50 61:53 Fisher’s exact test odds ratio: 1.11
p-value: 0.79
Number of images per patient [median (p25-p75)] 8 (5–10) 8 (7–8) Wilcoxon rank sum test: 0.27
p-value: 0.79

p25-p75: 25th and 75th percentiles.

Validation set

The class activation maps in the validation set showed that all the models were classifying images based on the presence of tubers, rather than on other features. The best performing CNN in the validation set was InceptionV3 (loss: 0.5325) followed by ResNet50 (loss: 0.5400), and TSCCNN (loss: 1.0416). Therefore, we used InceptionV3 in the final model. Other performance values are summarized in S1 Table at https://ivansanchezfernandez.github.io/TSC_Supplementary_Table_S1 and also show that InceptionV3 performed better than the other CNN architectures in the validation set.

Evaluation in the test set

InceptionV3 had an accuracy of 0.95 and an area under the receiver operating characteristic curve of 0.99 in the test set with a sensitivity of 0.95, specificity of 0.95, positive predictive value of 0.94, negative predictive value of 0.95, and F1 score (the harmonic mean of positive predictive value and sensitivity) of 0.95 (Table 2).

Table 2. Performance of InceptionV3 in the test set.

InceptionV3 Real classification
Accuracy: 0.95 AUC: 0.99 TSC Control
Predicted classification TSC 199 12 PPV: 0.94
Control 11 214 NPV: 0.95
Sen: 0.95 Spec: 0.95 F1: 0.95

AUC: Area under the receiver operator characteristic curve. F1: F1-score. NPV: Negative predictive value. PPV: Positive predictive value. Sen: Sensitivity. Spec: Specificity.

Model visualization

Although the purpose of this study was not segmentation, which requires a different CNN architecture, the class activation maps and saliency maps showed that the deep learning algorithm was focusing on tubers to classify the images (Figs 1 and 2). All the figures with the Grad-CAM and saliency maps for the 436 testing images can be found at: https://ivansanchezfernandez.github.io/TSC_VisualizationI/#images, https://ivansanchezfernandez.github.io/TSC_VisualizationII/#images, and https://ivansanchezfernandez.github.io/TSC_VisualizationIII/#images. A graph with the colorbar can be found at https://ivansanchezfernandez.github.io/TSC_heatmap_colorbar/.

Fig 1. Correctly classified images.

Fig 1

A. InceptionV3 was able to localize all or most tubers in this image with scattered and sometimes subtle tubers. B. InceptionV3 was able to localize the three relatively well-defined tubers in this image. C. InceptionV3 was able to localize the relatively well-defined tuber in this image. Although the image was classified as having tuber(s), the estimated probability was 0.71, as opposed to >0.99 for A and B. The first column represents the original image, the second column, the map, and the third column the map superimposed on the original image. The first row represents the gradient-weighted class activation map, and the second row represents the saliency map. Both gradient-weighted class activation maps and saliency maps visualizations are based on gradients. The gradient is the partial derivative of the loss function for each pixel in the image of reference (the last convolutional layer for gradient-weighted class activation maps and the original image for saliency maps). Gradient-weighted class activation maps use the gradient of the output category to the last convolutional layer (the last layer with spatial information). Saliency maps use the gradient of the output category to the original image. Both maps methods identify the pixels (in the last convolutional layer for gradient-weighted class activation maps and in the original image for saliency maps) that, if changed, would modify most the probability of the image belonging to the specific class (TSC or control). The resulting visualization is a heat map with values normalized between -1 (purple) and 1 (yellow) with hotter colors representing areas of greater importance for classification (see color bar at https://ivansanchezfernandez.github.io/TSC_heatmap_colorbar/). If you are not familiar with tubers, good examples can be found in Fig 1 in the Peters et al article summarizing neuroimaging in TSC [5]. A version of the images with arrows pointing to the tubers is available as S1 Fig at https://ivansanchezfernandez.github.io/TSC_Supplementary_Figures/.

Fig 2. Incorrectly classified images.

Fig 2

We would like to emphasize that incorrectly classified images represented only approximately 5% of the test set, but they sometimes provide insights into the reasons for misclassification. A. InceptionV3 classified this image as having tuber(s) with an estimated probability of 0.82, although it belonged to a control patient. The maps suggest a focus on prominent vascular spaces in the white matter suggestive of radial migration lines. B. InceptionV3 classified this image as having no tuber(s) despite the radiologist-confirmed subtle tuber in the right occipital region. The maps show a focus in the right region, but the model estimated a probability of having tuber(s) of only 4%. C. Although this occurred in a tiny minority of images, this image shows that sometimes the tuber is completely missed and the focus of the maps is not necessarily informative. The estimated probability of having tuber(s) was less than 1%. The first column represents the original image, the second column represents the map, and the third column represents the map superimposed on the original image. The first row represents the gradient-weighted class activation map, and the second row represents the saliency map. Both gradient-weighted class activation maps and saliency maps visualizations are based on gradients. The gradient is the partial derivative of the loss function for each pixel in the image of reference (the last convolutional layer for gradient-weighted class activation maps and the original image for saliency maps). Gradient-weighted class activation maps use the gradient of the output category to the last convolutional layer (the last layer with spatial information). Saliency maps use the gradient of the output category to the original image. Both maps methods identify the pixels (in the last convolutional layer for gradient-weighted class activation maps and in the original image for saliency maps) that, if changed, would modify most the probability of the image belonging to the specific class (TSC or control). The resulting visualization is a heat map with values normalized between -1 (purple) and 1 (yellow) with hotter colors representing areas of greater importance for classification (see color bar at https://ivansanchezfernandez.github.io/TSC_heatmap_colorbar/). If you are not familiar with tubers, good examples can be found in Fig 1 in the Peters et al article summarizing neuroimaging in TSC [5]. A version of the images with arrows pointing to the tubers (except for 2A which had no tubers) is available as as S2 Fig at https://ivansanchezfernandez.github.io/TSC_Supplementary_Figures/.

Evaluation in clinical cases

To illustrate how readers can apply this deep learning model, we applied it to 6 new patients with TSC and particularly challenging radiological features such as very subtle tubers or age-related incomplete myelination. The deep learning model differentiated MRI slices with tubers from those without tubers with a sensitivity of 0.67, specificity of 0.68, positive predictive value of 0.84, negative predictive value of 0.46, and F1 score of 0.75. See the results and images at: https://ivansanchezfernandez.github.io/TSC_TestCases/.

Interactive model

This deep learning model, the same as the one we used, is packaged in an App so it will produce the same results if exposed to the same images. This App avoids privacy challenges because it has no images on it, just the final model that users can test with their own images. We encourage readers to download and use the App on their own set of TSC test cases. Step-by-step use instructions available for Windows at: https://ivansanchezfernandez.github.io/TSC_TuberFinder_Windows/ and Apple at: https://ivansanchezfernandez.github.io/TSC_TuberFinder_Apple/. This system does not need to train the model, only test it with new images and, therefore, does not need a computer with graphics processing unit (GPU) and takes approximately one minute per image.

Discussion

This study shows that deep learning algorithms for recognizing patterns in brain MRIs can perform very well, even when the number of patients and images are relatively small as is the case for rare neurological conditions. Further, this study shows that deep learning models trained in a cloud computing environment can be made portable to deploy on local computers avoiding many hurdles related to privacy. Our best performing deep learning algorithm (InceptionV3) detected MRI slices and areas of interest within each slice at a level similar to a clinical radiologist. This system could provide diagnostic support in cases of suspected TSC, in non-academic settings where exposure to rare diseases is limited, and in low-resources areas with a limited number of expert medical specialists. At this time, however, clinical applicability is limited to discrimination of images with tubers from those without such lesions.

One of the major advantages of CNNs is that there is no need to specify the features that allow tuber recognition: CNNs automatically learn by examining multiple examples with the pattern and multiple examples without the pattern [15]. CNNs have revolutionized image classification and pattern detection within images in the last few years [15]. The ImageNet Large Scale Visual Recognition Challenge, popularly known as ImageNet, represents the benchmark for large-scale object recognition in computer vision and has allowed enormous progress in computer vision since 2012 [7]. In 2010 and 2011, the error rate of computer vision classification systems in ImageNet was above 25%, but the use of CNNs and enormous computing power with graphics processing units (GPUs) reduced that error below 10% in the period 2012–2014 [7]. Successive improvements in CNN architecture and optimization techniques reduced the error rate of CNNs at or below the error rate of human classification starting in 2015 [7].

Advances within the computer vision field to detect and localize common patterns in images such as dogs, cats, boats, cars, etc. were quickly translated to image recognition in medicine. Using more than 1 million retinal fundoscopy images, a team at Google developed and trained a CNN that was able to classify diabetic retinopathy at or above the level of ophthalmologists in two different testing sets [8]. Similarly, a team of researchers at Stanford University used 127,463 images to train and validate the InceptionV3 architecture (pre-trained with the ImageNet weights) and was able to detect and classify skin cancers with similar accuracy to dermatologists in a 1,942 biopsy-labeled test set [9]. In the field of Neurology, using a dataset of hundreds of thousands of head CT scans, a multicenter study from India was able to train a CNN to detect specific critical findings (intracranial hemorrhage and its types, fractures, midline shift, and mass effect) with an area under the receiving operator curve above 0.9 in most categories [11].

Unfortunately, these numbers of training images are not feasible in rare neurological conditions. For example, the largest multicenter TSC registry (TOSCA) contains clinical data for 2,093 patients, but no neuroimaging data [37], and the largest TSC multicenter study with neuroimaging data from the TSC Autism Centers of Excellence Research Network (TACERN) contains only 390 MRIs of 143 subjects [38]. Our set of images originated from a single hospital had 1,127 original images for training (566 in the TSC training set and 561 in the control training set), much smaller than in previous deep learning studies. However, data augmentation generated approximately 4 artificial new images per original image, and thus, the final training set of 5,634 images was enough to successfully train our CNNs as shown in the network convergence plots and achieve very good performance measures and localization of the tubers as shown by the Grad-CAM and saliency maps. We compared three commonly used CNN architectures: TSCCNN (a CNN architecture we developed based on the popular approach of several blocks consisting of convolutional layers followed by a max-pooling layer, and finished by a set of fully-connected layers and initialized with random weights), InceptionV3 (a CNN architecture developed by Google initialized with the ImageNet weights), and ResNet (a CNN developed by Microsoft initialized with the ImageNet weights which won the 2015 ImageNet challenge). Although these architectures are complex and the number of weights is very large, these CNN are relatively easy to train and apply to new image recognition tasks.

A major limitation of prior clinical applications of deep learning algorithms is that they are seldom applied outside the original study. The enormous computing resources for training a CNN requires a cloud computing environment, with resultant complications in maintaining privacy and confidentiality. Recently, a deep learning system to detect mammographic breast density was introduced into a clinical workflow, but still within the same hospital system that developed the CNN [39]. To overcome the portability limitation, we have developed an interactive standalone application with the best performing model (InceptionV3) to let readers test this method on their local images. Our model was developed in a cloud environment, but with anonymized images, not with identifiable full MRIs. An aspect in which our study is novel is that our application contains the trained model, but not the original images with which it was trained, overcoming privacy issues that result from sharing identifiable patient imaging data. This application will also allow clinicians and radiologists to easily apply a deep learning algorithm to their own patients’ images inside their own hospital’s computing environment without need to share patient data with a third party. We hope this standalone application will fuel interest in this approach and allow first-hand experience with how CNN technology may potentially impact future clinical practice. We believe that projects applying deep learning to medicine with a practical application bridge the world of clinical medicine and the world of computer science and may fuel interdisciplinary collaboration.

Neural networks are extremely complex mathematical functions that map raw inputs (images in this case) to outputs (image classification into TSC or control in this case) with minimal to no human guidance [40]. Their complexity also makes them one of the least interpretable techniques among machine learning algorithms [40]. Neural networks may make errors with the same or lower probability than human-driven medicine [8, 9, 11, 41], but their implementation remains slow because the rationale of incorrect classifications cannot be explained, raising complex liability issues [42]. Fortunately, important features from CNNs used for image classification can be depicted with Grad-CAM and saliency maps. These techniques can be loosely interpreted as “attention maps”: where within the image the CNN is paying attention to classify this image into a certain category. Of note, these maps are not to be interpreted as lesion segmentation maps. Lesion segmentation requires training images with manually marked edges of the lesion and a different CNN architecture. In this particular study, Grad-CAM and saliency maps clearly show that CNNs are classifying images based on the detection of tubers rather than on the detection of other spurious information that may coexist with tubers.

Strengths and weaknesses

The present model achieved a classification performance close to that of the gold standard (a clinical neuroradiologist (EY) with formal training in radiology, neuroradiology, and pediatric neuroradiology and part of the epilepsy center at our institution) despite being subject to multiple constraints, which show its robustness: 1) the CNNs were trained on a relatively small dataset, 2) original images were two dimensional .jpg images and their resolution was reduced to 224x224 pixels, a much lower resolution than that used by radiologists to interpret MRIs and of lower image quality, 3) the CNNs were standard architectures developed for multipurpose large object recognition, that is, no specific architectural changes were made to try to fit the CNNs to the task of identifying patterns in MRI images, and 4) the TSC patients were heterogeneous in the number and appearance of tubers in MRI. Despite these challenges, our work shows that CNNs are able to achieve a classification performance close to the gold standard, which may make them able to guide attention to certain MRI slices and areas within each slice where tubers are more likely. CNNs offer the advantage of objectivity: the deep learning model outputs an estimated probability of an image having tubers, and consistency: the deep learning model will always return the same results when presented with the same image. Large repositories of MRI images analyzed with more refined CNNs may further increase sensitivity and specificity to a level where this approach can be used in routine clinical practice. It was not the objective of this study to perform image segmentation. In the future, if a set of MRI images with tuber borders delineated by a radiologist becomes available, other architectures may be used to actually segment tubers. CNN architectures specifically designed for automatic segmentation such as “U”-shaped architectures or fully convolutional neural networks specifically designed for segmentation, may further improve the localization of individual tubers within an MRI slice at the pixel level. Similar to some prior deep learning studies [8, 9, 43, 44], we performed a binary classification task to differentiate between MRI images with tubers and without tubers. The performance of this deep learning model when presented with confounding pathologies (tumors, white matter lesions in multiple sclerosis, etc.) is unknown. In the future, once we gather enough MRI images with other pathologies we aim to develop multiclass classifiers that recognize multiple different pathologies on brain MRI. The performance of this deep learning model (trained only on subjects 5 year old or older) was very good when presented with tubers in the context of a more immature myelination status, as shown by the very challenging test cases (https://ivansanchezfernandez.github.io/TSC_TestCases/).

Deep learning algorithms have been implemented in medicine mostly within the limits of research studies. Their application in routine clinical practice has been challenged by huge computational requirements and privacy concerns. By integrating a trained deep learning algorithm within a standalone application, we demonstrate that application into routine clinical practice and portability of trained CNNs is feasible. Further, as the deep learning algorithm within the application processes images in the local computer, it can be used with patient-protected information as long as the local computer is within a HIPAA-compliant healthcare environment.

Conclusion

This study shows that CNNs trained on a relatively small dataset of manually selected low-resolution images is able to detect and localize tubers with a performance close to the gold standard of the neuroradiologist in clinical cases. These results, obtained through data augmentation, serve as a model on how to prudently apply deep learning research algorithms, even when neurological conditions and images for training are rare. This study also shows a model of disseminating deep learning models trained locally to a global audience overcoming privacy hurdles.

Acknowledgments

We are sincerely indebted to the generosity of the families and patients in TSC clinics across the United States who contributed their time and effort to this study. We would also like to thank the Tuberous Sclerosis Alliance for their continued support in TSC research.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

We would like to acknowledge the TACERN co-investigators.

Name Affiliation
Mustafa Sahin, MD, PhD Boston Children’s Hospital, Harvard Medical School, Boston, MA
Jurriaan M. Peters, MD, PhD Boston Children’s Hospital, Harvard Medical School, Boston, MA
Simon K. Warfield, PhD Computational Radiology Laboratory, Department of Radiology, Boston Children’s Hospital & Harvard Medical School, Boston, MA
Monisha Goyal, MD Department of Neurology, University of Alabama at Birmingham, Birmingham, AL
Deborah A. Pearson, PhD Department of Psychiatry and Behavioral Sciences, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX
Marian E. Williams, PhD Keck School of Medicine of USC, University of Southern California, Los Angeles, California
Darcy Krueger, MD, PhD Cincinnati Children's Hospital Medical Center, Cincinnati, OH
Ellen Hanson, PhD Department of Developmental Medicine, Boston Children’s Hospital, Boston, MA
Nicole Bing, PsyD Department of Developmental and Behavioral Pediatrics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
Hope Northrup, MD The University of Texas Health Science Center at Houston, TX
Bridget Kent, MA, CCC-SLP Department of Developmental and Behavioral Pediatrics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
Sarah O’Kelley, PhD University of Alabama at Birmingham, Birmingham, AL
Martina E. Bebin, MD, MPA University of Alabama at Birmingham, AL
Rajna Filip-Dhima, MS F.M. Kirby Neurobiology Center, Boston Children’s Hospital, Harvard Medical School, Boston, MA
Kira Dies, ScM, CGC F.M. Kirby Neurobiology Center, Boston Children’s Hospital, Harvard Medical School, Boston, MA
Joyce Y. Wu, MD Mattel Children's Hospital, David Geffen School of Medicine at University of California Los Angeles, CA
Stephanie Bruns Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
Benoit Scherrer, PhD Computational Radiology Laboratory, Department of Radiology, Boston Children’s Hospital & Harvard Medical School, Boston, MA
Gary Cutter, PhD University of Alabama at Birmingham, Data Coordinating Center, Birmingham, AL
Donna S. Murray, PhD Autism Speaks
Steven L. Roberds, PhD Tuberous Sclerosis Alliance

Mustafa Sahin, MD, PhD (mustafa.sahin@childrens.harvard.edu) and Darcy Krueger, MD, PhD (darcy.krueger@cchmc.org) are the lead authors authors at TACERN.

Data Availability

All code and results are available. The original neuroimages are not directly available because public distribution of patient data has to be requested to the Institutional Review Board. There are legal and ethical restrictions on how to share clinical data. Medical researchers with proper training on the ethical management of de-identified clinical data should contact the Institutional Review Board at Boston Children’s Hospital (IRB@childrens.harvard.edu) to request data access. All code with results and full models can be found at: https://ivansanchezfernandez.github.io/TSC_code_results_models/. The code on training and validation includes network convergence plot for each model showing the decrease in loss and increase in accuracy with plateauing, showing that the number of training examples was enough for training. In addition, we have released our best CNN model as an easy to use App with Kivy, version 1.10.1 (36), so that readers can test the model on their own MRI images. The App and instructions for Windows can be found at:Detection of tubers with convolutional neural networks, 12, https://ivansanchezfernandez.github.io/TSC_TuberFinder_Windows/. 265 The App and instructions for Apple can be found at: https://ivansanchezfernandez.github.io/TSC_TuberFinder_Apple/.

Funding Statement

JMP, MS, HN, JYW, DK and MEB were supported by the National Institute of Neurological Disorders And Stroke of the National Institutes of Health (NINDS) and Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD) under Award Number U01NS082320. ISF has received an Amazon Web Services Cloud Credits for Research support in the form of computational credits for his project on “Identification and localization of tubers in Tuberous Sclerosis Complex with deep learning convolutional neural networks”. JYW, DK, HN, MEB, MS, and JP received funding to collect the data as a part of the TACERN collaborative. The funders had no additional role in study design, data analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.

References

  • 1.O'Callaghan FJ, Shiell AW, Osborne JP, Martyn CN. Prevalence of tuberous sclerosis estimated by capture-recapture analysis. Lancet. 1998;351(9114):1490. [DOI] [PubMed] [Google Scholar]
  • 2.Osborne JP, Fryer A, Webb D. Epidemiology of tuberous sclerosis. Annals of the New York Academy of Sciences. 1991;615:125–7. 10.1111/j.1749-6632.1991.tb37754.x [DOI] [PubMed] [Google Scholar]
  • 3.Sampson JR, Scahill SJ, Stephenson JB, Mann L, Connor JM. Genetic aspects of tuberous sclerosis in the west of Scotland. Journal of medical genetics. 1989;26(1):28–31. 10.1136/jmg.26.1.28 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Crino PB, Henske EP. New developments in the neurobiology of the tuberous sclerosis complex. Neurology. 1999;53(7):1384–90. 10.1212/wnl.53.7.1384 [DOI] [PubMed] [Google Scholar]
  • 5.Peters JM, Taquet M, Prohl AK, Scherrer B, van Eeghen AM, Prabhu SP, et al. Diffusion tensor imaging and related techniques in tuberous sclerosis complex: review and future directions. Future neurology. 2013;8(5):583–97. 10.2217/fnl.13.37 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kassiri J, Snyder TJ, Bhargava R, Wheatley BM, Sinclair DB. Cortical tubers, cognition, and epilepsy in tuberous sclerosis. Pediatric neurology. 2011;44(5):328–32. 10.1016/j.pediatrneurol.2011.01.001 [DOI] [PubMed] [Google Scholar]
  • 7.Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge2015. Available from: https://arxiv.org/pdf/1409.0575.pdf.
  • 8.Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. Jama. 2016;316(22):2402–10. 10.1001/jama.2016.17216 [DOI] [PubMed] [Google Scholar]
  • 9.Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8. 10.1038/nature21056 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Zhang J, Gajjala S, Agrawal P, Tison GH, Hallock LA, Beussink-Nelson L, et al. Fully Automated Echocardiogram Interpretation in Clinical Practice. Circulation. 2018;138(16):1623–35. 10.1161/CIRCULATIONAHA.118.034338 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Chilamkurthy S, Ghosh R, Tanamala S, Biviji M, Campeau NG, Venugopal VK, et al. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. Lancet. 2018;392(10162):2388–96. 10.1016/S0140-6736(18)31645-3 [DOI] [PubMed] [Google Scholar]
  • 12.Northrup H, Krueger DA, International Tuberous Sclerosis Complex Consensus G. Tuberous sclerosis complex diagnostic criteria update: recommendations of the 2012 Iinternational Tuberous Sclerosis Complex Consensus Conference. Pediatric neurology. 2013;49(4):243–54. 10.1016/j.pediatrneurol.2013.08.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wang J, Perez L. The effectiveness of data augmentation in image classification using deep learning. ArXiv. 2017.
  • 14.Chollet F. keras. GitHub; 2015.
  • 15.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44. 10.1038/nature14539 [DOI] [PubMed] [Google Scholar]
  • 16.Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions2014. Available from: https://arxiv.org/pdf/1409.4842.pdf.
  • 17.Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. arXiv [Internet]. 2015. Available from: https://arxiv.org/pdf/1512.00567.pdf.
  • 18.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. arXiv [Internet]. 2015. Available from: https://arxiv.org/pdf/1512.03385.pdf.
  • 19.Veit A, Wilber M, Belongie S. Residual networks behave like ensembles of relatively shallow networks. URL: https://arxiv.org/pdf/1605.06431.pdf. ArXiv. 2016.
  • 20.Xie S, Girshick R, Dollár PT, Z., He K. Aggregated residual transformations for deep neural networks. arXiv [Internet]. 2017. Available from: https://arxiv.org/pdf/1611.05431.pdf.
  • 21.Torrey L, Shavlik J. Transfer Learning In: Soria Olivas E, Martin Guerrero JD, Martinez Sober M, Magdalena Benedito JR, Serrano Lopez AJ, editors. Handbook of research on machine learning applications and trends. Hershey, PA: IGI Global; 2010. [Google Scholar]
  • 22.Kingma DP, Ba JL. Adam: a method for stochastic optimization2017. Available from: https://arxiv.org/pdf/1412.6980.pdf.
  • 23.Rajpurkar P, Irvin J, Ball RL, Zhu K, Yang B, Mehta H, et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS medicine. 2018;15(11):e1002686 10.1371/journal.pmed.1002686 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zech J. What are radiological deep learning models actually learning? 2018. Available from: https://medium.com/@jrzech/what-are-radiological-deep-learning-models-actually-learning-f97a546c5b98.
  • 25.Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. arXiv. 2017.
  • 26.Long J, Shelhamer E, Darrell T. Fully Convolutional Networks for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(4):640–51. 10.1109/TPAMI.2016.2572683 [DOI] [PubMed] [Google Scholar]
  • 27.Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualizing image classification models and saliency maps. arXiv. 2014.
  • 28.Python Software Foundation. Python Language Reference, version 3.6. Available at http://www.python.org. 2018.
  • 29.McKinney W. Data Structures for Statistical Computing in Python. In: Conference PottPiS, editor. 2010.
  • 30.Oliphant TE. A guide to NumPy. Trelgol Publishing; 2006.
  • 31.SciPy. SciPy.Stats. URL: https://docs.scipy.org/doc/scipy/reference/stats.html. 2019.
  • 32.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 12: 2825–2830. 2011. [Google Scholar]
  • 33.Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv; 2016.
  • 34.Kotikalapudi R. keras-vis. URL: https://github.com/raghakot/keras-vis. GitHub; 2017.
  • 35.Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, et al. Jupyter Notebooks–a publishing format for reproducible computational workflows. 2014.
  • 36.Virbel M, Pettier G, Arora A, Kovac J, Taylor A, Einhorn M, et al. Kivy—Open source Python library for rapid development of applications, version 1.10.1. 2019.
  • 37.Kingswood JC, d'Augeres GB, Belousova E, Ferreira JC, Carter T, Castellana R, et al. TuberOus SClerosis registry to increase disease Awareness (TOSCA)—baseline data on 2093 patients. Orphanet journal of rare diseases. 2017;12(1):2 10.1186/s13023-016-0553-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Tuberous Sclerosis Complex Autism Center of Excellence Research Network. TACERN 2019 [March, 21, 2019]. Available from: http://tscstudy.com/.
  • 39.Lehman CD, Yala A, Schuster T, Dontchos B, Bahl M, Swanson K, et al. Mammographic Breast Density Assessment Using Deep Learning: Clinical Implementation. Radiology. 2019;290(1):52–8. 10.1148/radiol.2018180694 [DOI] [PubMed] [Google Scholar]
  • 40.Beam AL, Kohane IS. Big Data and Machine Learning in Health Care. Jama. 2018;319(13):1317–8. 10.1001/jama.2017.18391 [DOI] [PubMed] [Google Scholar]
  • 41.Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv [Internet]. 2017; 1711.05225.
  • 42.Deo RC. Machine Learning in Medicine. Circulation. 2015;132(20):1920–30. 10.1161/CIRCULATIONAHA.115.001593 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ehteshami Bejnordi B, Veta M, Johannes van Diest P, van Ginneken B, Karssemeijer N, Litjens G, et al. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. Jama. 2017;318(22):2199–210. 10.1001/jama.2017.14585 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Lindsey R, Daluiski A, Chopra S, Lachapelle A, Mozer M, Sicular S, et al. Deep neural network improves fracture detection by clinicians. Proc Natl Acad Sci U S A. 2018;115(45):11591–6. 10.1073/pnas.1806905115 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Kaiming Li

11 Mar 2020

PONE-D-20-02069

Deep learning in rare disease. Detection of tubers in tuberous sclerosis complex

PLOS ONE

Dear Dr. Sanchez Fernandez,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Apr 25 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Kaiming Li

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Our internal editors have looked over your manuscript and determined that it is within the scope of our Digital Health Technology Call for Papers. This collection of papers is headed by a team of Guest Editors for PLOS ONE: Eun Kyoung Choe (University of Maryland, College Park), Chelsea Dobbins (University of Queensland), Sunghoon Ivan Lee (University of Massachusetts, Amherst), and Claudia Pagliari (University of Edinburgh).

The Collection will encompass a diverse range of research articles on digital health technologies ranging from technology design to patient care and health systems management.  Additional information can be found on our announcement page: https://collections.plos.org/s/digital-health-tech.

If you would like your manuscript to be considered for this collection, please let us know in your cover letter and we will ensure that your paper is treated as if you were responding to this call. If you would prefer to remove your manuscript from collection consideration, please specify this in the cover letter.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

"Research reported in this publication was supported by the National Institute of Neurological Disorders And Stroke of the National Institutes of Health (NINDS) and Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD) under Award Number U01NS082320. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health."

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

"The author(s) received no specific funding for this work."

4. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

5. One of the noted authors is a group or consortium ACERN Study Group. In addition to naming the author group, please list the individual authors and affiliations within this group in the acknowledgments section of your manuscript. Please also indicate clearly a lead author for this group along with a contact email address.

6. Please upload a copy of Supplementary Table S1, D2; Figure S1; S2 which you refer to in your text on page 12, 14, 15.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Artificial intelligence and deep learning have accelerated studies of automated diagnosis for human diseases. Some published studies have focused on static pictures (retinopathy, skin cancer, and echocardiography etc.), and a few researches involve three-dimensional CT images. This submitted study goes further than the above ones: it analyzes three-dimensional and more detailed MR images. The author choose a rare disease with limited data to verify the feasibility of a deep learning algorithm method. It is a challenging and innovative study indeed, but I have several suggestions for the authors.

1, In real clinical practice, the logical flow of a diagnostic procedure is: (1) screening abnormalities from normal, then (2) performing differential diagnosis (tumor, FCD, etc.) from abnormalities to make a final diagnosis. In this manuscript, however, the logical flow is: (1)a specialist defines a specific disease (TSC) and completely normal controls respectively, then (2) a deep learning algorithm attempts to distinguish the two groups. As the gap between the two different logical flows could possibly hinder clinical application, I suggest the author to clarify the rationality of his research method further.

2, In general, deep learning uses large scale data, and the data are raw and unpicked, like the head CT study (S Chilamkurthy, Lancet 2018). In this manuscript, the author himself picked different specific slices (5-10 slices) from each patients’ MR scans (Line 117, &256) to bring into the CNN procedures. Whether this manual selection is the reason why deep learning can be done with a small number of cases. The method of deep learning after manual screening, it may be a feasible solution to produce results in the short term, but it might delivery misleading information and have restricted clinically applicational prospect. The author should make further declaration.

3, The Conclusion (Line 54, &459) should be refined by adding restrictive attributive to deliver precise information. For example, it may be concluded that deep learning algorithms can distinguish previously manual screened TSC MR scans from normal MR scans. Or, deep learning can be prudently applied to a small but highly selected dataset in a rare neurological disorder.

To sum up, the author suggests a new idea for deep learning application, although there are still some places that need further clarification. I’d like to recommend the manuscript for publication after some revisions.

Reviewer #2: Introduction:

- The motivation behind this work was clearly stated that Tuberous sclerosis complex (TSC) has diverse characteristics between individuals. Automatic tuber detection through the brain MRI can improve diagnostic certainty due to limited number of medical specialists. Deep learning approaches have been approved with promising performance on image classification tasks in many medical applications with a huge number of training images. But it is still challenging in rare neurological disorder diagnosis.

- This paper aims to demonstrate Convolutional neural networks (CNNs) can be developed for detection of rare brain anomalies with a relatively small dataset, which points to the availability of deep learning with transfer learning on new medical imaging tasks with good performance and a good solution to overcome privacy problems.

- The problems related to the TSC diagnosis was clearly stated. But there is no data about the diagnosis performance by specialists. I think such kind of information and comparison might be useful to have a better understanding of the difficulty of this classification task by human. The significance of deep learning for medical applications could be highlighted.

Methods:

- Dataset part was well explained in detail, from the inclusion criteria for TSC patients, the criteria for the selection of images, MRI sequences and the train/validation/test datasets. It is important to re-implement the experiment.

- The data augmentation was introduced with detailed parameters, which is important to avoid overfitting. Three different CNN architectures were deployed. One model was trained from scratch and another two models were fine-tuned from the pre-trained model. All training configurations were provided for reference. The best models were obtained from the validation dataset and further verified on the test dataset. No patients data existed in both the validation and test datasets, with extra data for further verification. The model visualization part mainly focuses on the application of Grad-CAM and saliency map methods, which could interpret the classification procedure of CNN models vividly and intuitively.

- All data were divided into three part for training, validation and testing with same number of patients and normal controls. It is more convinced that if k-fold cross-validation could be used to verify the model performance. Especially for small datasets, the variance could be high on different subsets of data. And such kind of performance variance could also be interesting to evaluate the robustness of models.

Results and Discussion:

- The performance with the InceptionV3 was provided. Based on the model visualization, the salient regions were highlighted. And incorrectly classified samples also provided some insights of misclassification. Therefore, the feasibility of the deep learning models on MRI diagnosis tasks has been demonstrated. With the cloud-based training and local inference, the privacy problem could also be avoided.

- The InceptionV3 was selected as the final model due to the lowest validation loss. I think the ResNet50 had comparable performance, which should also be involved for the further experiment and comparison. It is more convinced if similar conclusions could also be drawn.

- The potential of image segmentation was also discussed. As an extension from the current classification task, the segmentation task is a promising solution for better understanding beyond the existing clinical knowledge and experience. But the availability of datasets, the huge computational requirements and privacy concerns should be always noted.

Conclusion:

- The feasibility and good performance of deep learning models on the TSC classification has been discussed. With limited number of data, the competitive performance has been achieved compared to the neuroradiologist. The cloud-based computation and local inference are promising for further applications.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Apr 29;15(4):e0232376. doi: 10.1371/journal.pone.0232376.r002

Author response to Decision Letter 0


24 Mar 2020

Dear Dr. Li, dear reviewers,

Thank you for your kind consideration of our manuscript entitled “Deep learning in rare disease. Detection of tubers in tuberous sclerosis complex” (PONE-D-20-02069). We are grateful for the helpful comments and we have modified the manuscript in order to implement reviewers’ constructive suggestions. Below, please, find a detailed response to the comments.

All changes are shown as “tracked changes” mode and highlighted in a version of the manuscript uploaded as a supplementary file.

Thank you for reconsidering our manuscript for publication.

Sincerely,

Iván Sánchez Fernández & Jurriaan Peters

RESPONSE TO EDITORIAL COMMENTS

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

Thank you for the reminder. Some of the authors as part of the TACERN collaborative received funding to collect the data, although this analysis of the data was not specifically funded. To err on the side of caution, we identified these authors by their initials, and mention the funding support in the Funding Section on the title page. Iván Sánchez Fernández has recently received an Amazon Web Services Cloud Credits for Research in the amount of $9,700 in computational credits for his project on “Identification and localization of tubers in Tuberous Sclerosis Complex with deep learning convolutional neural networks”. We have added this new computational credit support to the cover letter and in the study funding section of the title page.

Response to this comment in the text: Cover letter. Study funding section in the title page.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We appreciate the opportunity to share our protocols, analyses, and code. We made our code and results publicly available in full at GitHub and Zenodo. However, we also deposited our code and results at protocols.io at dx.doi.org/10.17504/protocols.io.bdt3i6qn.

Response to this comment in the text: links to github.com, zenodo.org, and protocols.io

When submitting your revision, we need you to address these additional requirements:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf

We reviewed and made sure that the formatting met the PLOS ONE formatting requirements.

Response to this comment in the text: All files.

2. Our internal editors have looked over your manuscript and determined that it is within the scope of our Digital Health Technology Call for Papers. This collection of papers is headed by a team of Guest Editors for PLOS ONE: Eun Kyoung Choe (University of Maryland, College Park), Chelsea Dobbins (University of Queensland), Sunghoon Ivan Lee (University of Massachusetts, Amherst), and Claudia Pagliari (University of Edinburgh).

The Collection will encompass a diverse range of research articles on digital health technologies ranging from technology design to patient care and health systems management. Additional information can be found on our announcement page: https://collections.plos.org/s/digital-health-tech.

If you would like your manuscript to be considered for this collection, please let us know in your cover letter and we will ensure that your paper is treated as if you were responding to this call. If you would prefer to remove your manuscript from collection consideration, please specify this in the cover letter.

We thank the editors for this consideration, but would prefer our manuscript to be considered as a regular article. We specified this in the cover letter.

Response to this comment in the text: Cover letter.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

"Research reported in this publication was supported by the National Institute of Neurological Disorders And Stroke of the National Institutes of Health (NINDS) and Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD) under Award Number U01NS082320. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health."

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

"The author(s) received no specific funding for this work."

We appreciate the opportunity to clarify this relevant aspect. Some of the authors as part of the TACERN collaborative received funding to collect this data, although this analysis of the data was not specifically funded. We identified these authors by their initials, and mentioned the funding support in the Funding Section on the title page.

Response to this comment in the text: Title page.

4. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

Although we would like to share the magnetic resonance imaging (MRI) data publicly, these are clinical data from patients and there are legal and ethical restrictions on how to share them. The Institutional Review Board at Boston Children’s Hospital allowed us to use de-identified MRI data for this study because we are medical researchers who had specific training on the ethical use of medical data and will use these data only for the purposes of this research. The possibility of re-identification of subjects when using clinical data in this manner by trained medical researchers is remote. However, making the data publicly available to any user in the internet is not allowed by Institutional Review Boards. Users with no ethical training in the management of clinical data may use the MRI images for non-research purposes or may try to re-identify patients. That possibility will violate the patients’ rights. Interested researchers may request permission to the Institutional Review Board at Boston Children’s Hospital to obtain these data and, after appropriate verification of training in ethical management of medical data and an appropriate contract on how to use these data, the authors will make the data available.

Following the editorial comment, we have expanded the “Data availability” subsection to specify the legal and ethical restrictions of clinical data and the Institutional Review Board responsible for these data.

Response to this comment in the text: “Patients and methods” section, “Data availability” subsection (lines 258-260 in the track changes manuscript).

5. One of the noted authors is a group or consortium TACERN Study Group. In addition to naming the author group, please list the individual authors and affiliations within this group in the acknowledgments section of your manuscript. Please also indicate clearly a lead author for this group along with a contact email address.

Thank you for the opportunity to further detail the participants in the TACERN consortium. We have provided a list of the TACERN co-investigators in the acknowledgments section of the manuscript and have identified Dr. Mustafa Sahin, MD, PhD and Dr. Darcy Krueger, MD, PhD as the lead authors for this group.

Response to this comment in the text: “Acknowledgements” section (lines 602-603 in the track changes manuscript).

6. Please upload a copy of Supplementary Table S1, D2; Figure S1; S2 which you refer to in your text on page 12, 14, 15.

We have now uploaded a copy of Supplementary Table S1, D2, Figure S1, and S2 to the protocols.io files. These supplementary data can also be found on the GitHub and Zenodo repositories.

RESPONSE TO REVIEWER’S QUESTIONS

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

________________________________________

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

________________________________________

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

________________________________________

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

________________________________________

RESPONSE TO REVIEWER 1

Artificial intelligence and deep learning have accelerated studies of automated diagnosis for human diseases. Some published studies have focused on static pictures (retinopathy, skin cancer, and echocardiography etc.), and a few researches involve three-dimensional CT images. This submitted study goes further than the above ones: it analyzes three-dimensional and more detailed MR images. The author choose a rare disease with limited data to verify the feasibility of a deep learning algorithm method. It is a challenging and innovative study indeed, but I have several suggestions for the authors.

1, In real clinical practice, the logical flow of a diagnostic procedure is: (1) screening abnormalities from normal, then (2) performing differential diagnosis (tumor, FCD, etc.) from abnormalities to make a final diagnosis. In this manuscript, however, the logical flow is: (1) a specialist defines a specific disease (TSC) and completely normal controls respectively, then (2) a deep learning algorithm attempts to distinguish the two groups. As the gap between the two different logical flows could possibly hinder clinical application, I suggest the author to clarify the rationality of his research method further.

We appreciate the opportunity to clarify this important point. Training a convolutional neural network (and other deep learning and machine learning models) requires labelled data. Labels classify observations (for example, MRI images) as belonging to some category (for example, an MRI with tubers or a normal MRI) and allow the model to “learn” during training the main features that make some observations (for example, MRI images) belong into some category.

Once trained, the convolutional neural network is presented with new observations (for example, MRI images) and is tested on its ability to predict their label (for example, an MRI with tubers or a normal MRI) on these previously unseen observations.

Therefore, extracting observations (MRI images in our case) and their labels (having tubers or belonging to normal MRIs) is a prerequisite to develop a neural network. However, once the model has been trained, validated, and tested, its clinical use follows the typical clinical flow of showing the model new data and expecting the model to provide a label (diagnosis) for it.

Following the reviewer’s thoughtful comment, we have expanded on this relevant aspect in the newly created subsection “Model development versus clinical practice: clinical cases”.

Response to this comment in the text: “Patients and methods” section, “Model development versus clinical practice: clinical cases” subsection (lines 214-219 in the track changes manuscript).

2, In general, deep learning uses large scale data, and the data are raw and unpicked, like the head CT study (S Chilamkurthy, Lancet 2018). In this manuscript, the author himself picked different specific slices (5-10 slices) from each patients’ MR scans (Line 117, &256) to bring into the CNN procedures. Whether this manual selection is the reason why deep learning can be done with a small number of cases. The method of deep learning after manual screening, it may be a feasible solution to produce results in the short term, but it might delivery misleading information and have restricted clinically applicational prospect. The author should make further declaration.

We appreciate the reviewer’s feedback. The main reason for manually selecting slices is that in a situation where we would select normal MRI slices (without tubers) from a patient with tuberous sclerosis complex would misinform CNNs. Such images would prohibit training as the label for these slices would be “TSC+” because they are derived from patients with TSC, but there would not any features in the slices that allows them to be classified as belonging to the TSC group. In other words, the objective of the study was to develop a deep learning method to detect tubers in MRI slices, not to detect MRI slices (with or without tubers in them) from patients with tuberous sclerosis complex.

Other reasons include privacy and confidentiality (3-dimensional brain MRIs are more easily re-identifiable than slices) and the structure of convolutional neural networks. Most convolutional neural networks have been developed to take as input two-dimensional images. Although there are convolutional neural networks to be used with 3 dimensional inputs, they are not as advanced as convolutional neural networks for 2 dimensional data.

The final deep learning model can be applied to new MRI images as we showed with the test cases in our manuscript. This model is readily available for use on any new data in our standalone app that any user can try on their own 2D imaging data

We recognize, however, the concern of the reviewer that the clinical applicability is limited to the user provide slices containing an abnormality. We have clarified this in the manuscript at two locations – in the “Patient and methods” section, and in the “Discussion” section, as outlined below.

Response to this comment in the text:

1 - “Patients and methods” section, “MRI sequences, image labeling, and division into training, validation, and testing” subsection (lines 127-129 in the track changes manuscript).

2 – “Discussion” section, lines 384-386

3, The Conclusion (Line 54, &459) should be refined by adding restrictive attributive to deliver precise information. For example, it may be concluded that deep learning algorithms can distinguish previously manual screened TSC MR scans from normal MR scans. Or, deep learning can be prudently applied to a small but highly selected dataset in a rare neurological disorder.

We agree that the conclusion should not overstate the clinical applicability, and that we should be precise about limitations of our trained CNN. Thus, following the reviewer’s thoughtful suggestion, we modified our conclusions to reflect that deep learning algorithms, once trained on manually screened MRI scans are able to differentiate normal from abnormal MRI slices and this can be prudently applied clinically to identify whether there are tubers in an MRI. (We would like to emphasize that the manual selection of images is performed only to provide representative examples during the training process, but once the algorithm is trained, the CNN model can recognize tubers in a set of unselected images as we showed in the clinical test examples). We agree, however, with the reviewer that the conclusion should better reflect that approach and modified the conclusions accordingly.

Response to this comment in the text: “Abstract” section (lines 63-65 in the track changes version of the manuscript), “Conclusion” subsection (lines 491-494 in the track changes version of the manuscript).

To sum up, the author suggests a new idea for deep learning application, although there are still some places that need further clarification. I’d like to recommend the manuscript for publication after some revisions.

We appreciate the helpful comments which helped us improve the quality of the manuscript.

RESPONSE TO REVIEWER 2

- The motivation behind this work was clearly stated that Tuberous sclerosis complex (TSC) has diverse characteristics between individuals. Automatic tuber detection through the brain MRI can improve diagnostic certainty due to limited number of medical specialists. Deep learning approaches have been approved with promising performance on image classification tasks in many medical applications with a huge number of training images. But it is still challenging in rare neurological disorder diagnosis.

We are glad the reviewer appreciates and recognizes the challenge of deep learning in rare disorders. Deep learning approaches, although developed and used in many medical applications where a huge number of images are available, can be challenging to develop in rare neurological conditions where the number of available images is relatively small. Our approach is novel in that it shows that data augmentation can help train convolutional neural networks so that their performance is still very good even when the number of images available is not huge. We also appreciate the opportunity to emphasize that these models may be useful in areas of the world with a limited number of expert medical specialists. Following the reviewer’s thoughtful comment, we further emphasized these relevant aspects in the manuscript.

Response to this comment in the text: “Introduction” section, second paragraph (line 82 in the track changes version of the manuscript). “Discussion” section, first paragraph (line 384-386 in the track changes version of the manuscript).

- This paper aims to demonstrate Convolutional neural networks (CNNs) can be developed for detection of rare brain anomalies with a relatively small dataset, which points to the availability of deep learning with transfer learning on new medical imaging tasks with good performance and a good solution to overcome privacy problems.

We are grateful for the chance to further emphasize these relevant points. As the reviewer points out, our approach is novel because it shows that CNNs can be developed for detecting rare brain anomalies, even when the number of available images is relatively small. We emphasized deep learning with transfer learning and the ability to pack models in apps to overcome the privacy problems.

Response to this comment in the text: “Patients and methods” section, “Data augmentation” (lines 153-154 in the track changes version of the manuscript), and “Model development” subsections (lines 196-197 in the track changes version of the manuscript). “Discussion” section, fifth paragraph (lines 432-433 in the track changes version of the manuscript).

- The problems related to the TSC diagnosis was clearly stated. But there is no data about the diagnosis performance by specialists. I think such kind of information and comparison might be useful to have a better understanding of the difficulty of this classification task by human. The significance of deep learning for medical applications could be highlighted.

As the reviewer correctly points out, there is no formal data on the diagnostic performance by specialists. From our experience in clinical practice, detecting tubers and quantifying tuber burden is not difficult for trained neuroradiologists and pediatric neurologists. However, the benefit of developing CNN models to identify and quantify tuber burdens would be most helpful in non-academic settings and, especially, in low-resources areas where there is a limited number and limited access to expert medical specialists.

Response to this comment in the text: “Discussion” section, first paragraph (lines 384 in the track changes version of the manuscript).

Methods:

- Dataset part was well explained in detail, from the inclusion criteria for TSC patients, the criteria for the selection of images, MRI sequences and the train/validation/test datasets. It is important to re-implement the experiment.

We agree with the reviewer that a rigorous approach to deep learning involves re-implementing the best model in a test set. We trained the model in the train set, validated its performance in the validation set, and the best performing model was tested in a test set. Because the training, validation, and test sets were independent (no patients in common between the three sets), the performance in the test set is a rigorous re-evaluation of the performance of the final model in data it was never exposed to previously.

Response to this comment in the text: “Patients and methods” section, “Minimizing overfitting” subsection (line 142-144 in the track changes version of the manuscript).

- The data augmentation was introduced with detailed parameters, which is important to avoid overfitting. Three different CNN architectures were deployed. One model was trained from scratch and another two models were fine-tuned from the pre-trained model. All training configurations were provided for reference. The best models were obtained from the validation dataset and further verified on the test dataset. No patients data existed in both the validation and test datasets, with extra data for further verification. The model visualization part mainly focuses on the application of Grad-CAM and saliency map methods, which could interpret the classification procedure of CNN models vividly and intuitively.

The reviewer correctly summarized the main steps taken to minimize overfitting. We performed several steps to rigorously minimize overfitting. The visualization of the attention maps helps understand how the CNN model is making decisions. We further emphasized these aspects in the manuscript.

Response to this comment in the text: “Patients and methods” section, “Model visualization” subsection (lines 230-231 in the track changes version of the manuscript).

- All data were divided into three part for training, validation and testing with same number of patients and normal controls. It is more convinced that if k-fold cross-validation could be used to verify the model performance. Especially for small datasets, the variance could be high on different subsets of data. And such kind of performance variance could also be interesting to evaluate the robustness of models.

Our approach of keeping independent training, validation, and test sets is also termed “held-out cross-validation”. Therefore, we have already performed cross-validation.

We minimized overfitting and made our deep learning method robust by keeping sets of images (training, validation, and test) completely independent of each other, by making the convolutional neural network not prone to overfitting using addition of random noise, batch normalization, dropout, and global average pooling, and also by using data augmentation. These are the ways convolutional neural networks reduce overfitting and show robustness.

Cross-validation is a machine learning technique typically used in other methods (such as random forests or support vector machines) with a relatively small number of parameters. Convolutional neural networks have thousands to millions of parameters and other methods are better to minimize overfitting and show robustness. Cross-validation is rarely used in convolutional neural networks and there are several reasons for that:

(1) Minimizing overfitting and showing robustness of convolutional neural networks is achieved with all the techniques shown above. For example, the dropout method or the ResNet50 architecture are conceptually similar to cross-validation in that the final model is an ensemble of different versions of the model trained with different data to maximize generalization.

(2) The way to evaluate the robustness of a deep learning model to new data is using a held out test set that is actually new data (the deep learning model never saw it). That is the approach used in convolutional neural networks applied to medical problems [Gulshan, V. et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. Jama 316, 2402-2410, doi:10.1001/jama.2016.17216 (2016). Chilamkurthy, S. et al. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. Lancet 392, 2388-2396, doi:10.1016/S0140-6736(18)31645-3 (2018).]. As long as the held out test set is a random sample of the original database, it is a perfectly valid way to evaluate performance in new data because the test set is actually new data.

(3) In our particular database, cross validation in the training set. With 69 TSC patients and 69 control patients, a 10-fold cross-validation would be testing results on less than 7 TSC patients and less than 7 control patients in each fold, which is probably associated with a big random variability between the folds not necessarily reflecting model fit, but just small sample variability. Even when averaged this method does not appear to provide a clear advantage over the more standard split into training, validation, and test.

Response to this comment in the text: “Patients and methods” section, “Minimizing overfitting” subsection (lines 142-144 in the track changes version of the manuscript).

For the interested reader, we would like to take this opportunity to clarify in more detail how this method is a held-out cross-validation, how overfitting is minimized and how the robustness of a convolutional neural network is evaluated in our paper.

Convolutional neural networks are complex mathematical functions with a huge number of parameters, which allows them to fit well complex datasets but, at the same time, makes them prone to fit the data too well: overfitting. Overfitting occurs when the model recognizes patterns in the data but these patterns exist in the specific dataset, but they are not generalizable. There are several ways to avoid overfitting, some common to all convolutional neural networks and some specific to situations where there is a relatively small number of training examples.

(1) Among the methods to minimize overfitting in convolutional neural networks in general, the most commonly used and powerful one is to completely separate training, validation, and test sets. With this method we actually evaluate generalizability because the convolutional neural network ability to detect patterns is tested in data it never saw before: the test set. Our convolutional neural networks were trained using only the images from the 69 TSC patients and 69 control patients. Each of the three convolutional neural networks learned only on the training set. The validation set with the images of 20 TSC patients and 20 control patients was used to track progress of the convolutional neural networks: see how well they were predicting on data they did not train on. That is, the validation set does not contribute to train any of the convolutional neural networks, but allows estimating how well they will generalize because these convolutional neural networks are tested on these validation data (data it never trained on). Further, the validation set serves another purpose when there are several convolutional neural networks being compared: once the convolutional neural networks have been completely trained and their parameters are final, the convolutional neural network with the best performance on the validation set is the one selected as the final model because it is the one with the best expected generalizability (it is performing the best with data it never trained on). Once the best (of the three) convolutional neural networks is selected, then it is tested on data it never saw before: the test set with 25 TSC patients and 25 control patients. Therefore, the performance of the selected convolutional neural network on the test set is a good estimation on how well it will generalize to new datasets because the test set is actually a new dataset it never saw before. There are no shared patients or images between the training, validation, and test datasets. The training, validation, and test datasets are completely separate from each other and that is one of the methods by which overfitting is avoided: we demonstrate generalizability of the final model by testing it in data it never saw: the test set.

Explanations of the rationale for training, validation, and test sets can be found here https://deeplizard.com/learn/video/Zi-0rlM4RDs and here https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7.

(2) Other general methods to minimize overfitting and make models robust in convolutional neural networks consist of avoiding the convolutional neural network to fit too well the training dataset. These include addition of random noise, batch normalization, dropout, and global average pooling. The global idea behind all these methods is making the training data “fuzzier” (note that this is a simplification of reality, the methods are more complicated) so that the convolutional neuronal is able to fit the features that are most characteristic of the pattern of interest, but not the noise that randomly appears in the training dataset and is not generalizable to other datasets. In particular, addition of random noise adds a layer to the initial architecture of the convolutional neural network so that the training input gets slightly modified and, therefore, it is more difficult for the convolutional neural network to memorize the training examples and makes it more robust and generalizable (see the Gaussian noise layers “model.add(GaussianNoise(0.2))” in our code https://ivansanchezfernandez.github.io/TSC_training_validation/). Batch normalization works by normalizing (subtracting the mean and dividing by the standard deviation) the outputs of a hidden layer so that the next hidden layer has inputs without extreme values helping training and, through regularization, avoiding overfitting (see the batch normalization layers “model.add(BatchNormalization())” and “BatchNormalization” in our code https://ivansanchezfernandez.github.io/TSC_training_validation/). Dropout works by randomly dropping some of the hidden units in the convolutional neural network at each training step so that the final convolutional neural network is an ensemble of different convolutional neural networks which makes it more robust to noise and minimizes overfitting (see the dropout layers “model.add(Dropout(rate = 0.5))” in our code https://ivansanchezfernandez.github.io/TSC_training_validation/). Global average pooling minimizes overfitting by reducing the total number of parameters in the model by reducing dimensions in the final layers by taking the average across some dimensions (see the global average pooling layers “GlobalAveragePooling2D()(x)” in our code https://ivansanchezfernandez.github.io/TSC_training_validation/). Explanations of these methods can be found here https://machinelearningmastery.com/train-neural-networks-with-noise-to-reduce-overfitting/, https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c, https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/, and here https://alexisbcook.github.io/2017/global-average-pooling-layers-for-object-localization/.

(3) Finally, a method that is typically used when the number of training examples is relatively small is data augmentation. Data augmentation consists of synthetically creating artificial examples from the original training examples by randomly rotating, shifting, and zooming the original images. This method provide more examples for learning but it also has a regularization effect because the convolutional neural network cannot overfit noisy features (like location or size of the pattern of interest), but has to recognize the pattern of interest by its most characteristic features.

In summary, we minimized overfitting through the above methods, and made our deep learning method robust by keeping sets of images (training, validation, and test) completely independent of each other, by making the convolutional neural network not prone to overfitting using addition of random noise, batch normalization, dropout, and global average pooling, and also by using data augmentation.

Results and Discussion:

- The performance with the InceptionV3 was provided. Based on the model visualization, the salient regions were highlighted. And incorrectly classified samples also provided some insights of misclassification. Therefore, the feasibility of the deep learning models on MRI diagnosis tasks has been demonstrated. With the cloud-based training and local inference, the privacy problem could also be avoided.

Thank you for summarizing the main features of the “Results” section in our manuscript. We further emphasized that this App avoids privacy challenges because it has no images on it, just the final model.

Response to this comment in the text: “Results” section, “Interactive model” subsection (lines 367-368 in the track changes version of the manuscript).

- The InceptionV3 was selected as the final model due to the lowest validation loss. I think the ResNet50 had comparable performance, which should also be involved for the further experiment and comparison. It is more convinced if similar conclusions could also be drawn.

Thank you for bringing up this important point. The usual approach in deep learning CNN is to extract only one model from the validation step and test it in the test set. In this way, improvements in performance from validation to testing cannot be attributed to “cherry picking” the model. However, the reviewer makes a very relevant point and following the reviewer’s thoughtful comment we tested the performance of ResNet50 on the test set with very good results as shown below.

ResNet50

Accuracy: 0.94 AUC: 0.99 Real classification

TSC Control

Predicted classification TSC 182 2 PPV: 0.99

Control 28 287 NPV: 0.91

Sen: 0.87 Spec: 0.99 F1: 0.92

Legend: AUC: Area under the receiver operator characteristic curve. F1: F1-score. NPV: Negative predictive value. PPV: Positive predictive value. Sen: Sensitivity. Spec: Specificity.

The full code and results are present at https://ivansanchezfernandez.github.io/TSC_ResNet50/

Response to this comment in the text: As calculating the performance of a non-selected model in the test set is not standard in the deep learning CNN approach, we present these results only in the response to peer-review comments which will be publicly available for the interested reader through PLOS One open access standards. These results show the robustness of the methods.

- The potential of image segmentation was also discussed. As an extension from the current classification task, the segmentation task is a promising solution for better understanding beyond the existing clinical knowledge and experience. But the availability of datasets, the huge computational requirements and privacy concerns should be always noted.

We appreciate the opportunity to clarify this relevant aspect. We emphasized that image segmentation would require training images with manually marked edges of the lesion and a different convolutional neural network architecture. We aim to develop a study on lesion segmentation in the future and we are working on slowly building the appropriate dataset of manually marked images, but we do not presently have the data for that approach.

Response to this comment in the text: “Results” section, “Model visualization” subsection (lines 302-303 in the track changes version of the manuscript). “Discussion” section, fifth paragraph (lines 432-433 in the track changes version of the manuscript).

Conclusion:

- The feasibility and good performance of deep learning models on the TSC classification has been discussed. With limited number of data, the competitive performance has been achieved compared to the neuroradiologist. The cloud-based computation and local inference are promising for further applications.

We appreciate the helpful comments and we have emphasized all aspects discussed by the reviewer, which helped us improve the quality of the manuscript better through critical peer-review.

________________________________________

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

Thank you, we used PACE and the images met PLOS specifications.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Kaiming Li

14 Apr 2020

Deep learning in rare disease. Detection of tubers in tuberous sclerosis complex

PONE-D-20-02069R1

Dear Dr. Sanchez Fernandez,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Kaiming Li

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The author clarified the necessity of labeled data and the difference between deep learning and clinical practice, by adding the “Model development versus clinical practice” paragraph. The author also elucidated the concern that clinical applicability is limited to the user provide slices containing an abnormality. The conclusion has been rewritten precisely to reflect the mechanism of the deep learning and the future application. Therefore I would like to recommend the manuscript for publication in Plos One.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Acceptance letter

Kaiming Li

17 Apr 2020

PONE-D-20-02069R1

Deep learning in rare disease. Detection of tubers in tuberous sclerosis complex

Dear Dr. Sanchez Fernandez:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Kaiming Li

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All code and results are available. The original neuroimages are not directly available because public distribution of patient data has to be requested to the Institutional Review Board. There are legal and ethical restrictions on how to share clinical data. Medical researchers with proper training on the ethical management of de-identified clinical data should contact the Institutional Review Board at Boston Children’s Hospital (IRB@childrens.harvard.edu) to request data access. All code with results and full models can be found at: https://ivansanchezfernandez.github.io/TSC_code_results_models/. The code on training and validation includes network convergence plot for each model showing the decrease in loss and increase in accuracy with plateauing, showing that the number of training examples was enough for training. In addition, we have released our best CNN model as an easy to use App with Kivy, version 1.10.1 (36), so that readers can test the model on their own MRI images. The App and instructions for Windows can be found at:Detection of tubers with convolutional neural networks, 12, https://ivansanchezfernandez.github.io/TSC_TuberFinder_Windows/. 265 The App and instructions for Apple can be found at: https://ivansanchezfernandez.github.io/TSC_TuberFinder_Apple/.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES