Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 May 1.
Published in final edited form as: J Magn Reson Imaging. 2019 Oct 18;51(5):1487–1496. doi: 10.1002/jmri.26959

Deep-Learning-Based Neural Tissue Segmentation of MRI in Multiple Sclerosis: Effect of Training Set Size

Ponnada A Narayana 1,*, Ivan Coronado 1, Sheeba J Sujit 1, Jerry S Wolinsky 2, Fred D Lublin 3, Refaat E Gabr 1
PMCID: PMC7165037  NIHMSID: NIHMS1056144  PMID: 31625650

Abstract

Background:

The dependence of deep-learning (DL)-based segmentation accuracy of brain MRI on the training size is not known.

Purpose:

To determine the required training size for a desired accuracy in brain MRI segmentation in multiple sclerosis (MS) using DL.

Study Type:

Retrospective analysis of MRI data acquired as part of a multicenter clinical trial.

Study Population:

In all, 1008 patients with clinically definite MS.

Field Strength/Sequence:

MRIs were acquired at 1.5T and 3T scanners manufactured by GE, Philips, and Siemens with dual turbo spin echo, FLAIR, and T1-weighted turbo spin echo sequences.

Assessment:

Segmentation results using an automated analysis pipeline and validated by two neuroimaging experts served as the ground truth. A DL model, based on a fully convolutional neural network, was trained separately using 16 different training sizes. The segmentation accuracy as a function of the training size was determined. These data were fitted to the learning curve for estimating the required training size for desired accuracy.

Statistical Tests:

The performance of the network was evaluated by calculating the Dice similarity coefficient (DSC), and lesion true-positive and false-positive rates.

Results:

The DSC for lesions showed much stronger dependency on the sample size than gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF). When the training size was increased from 10 to 800 the DSC values varied from 0.00 to 0.86 ± 0.016 for T2 lesions, 0.87 ± 009 to 0.94 ± 0.004 for GM, 0.86 ± 0.08 to 0.94 ± 0.005 for WM, and 0.91 ± 0.009 to 0.96 ± 0.003 for CSF.

Data Conclusion:

Excellent segmentation was achieved with a training size as small as 10 image volumes for GM, WM, and CSF. In contrast, a training size of at least 50 image volumes was necessary for adequate lesion segmentation.


WHOLE, TISSUE-SPECIFIC, and regional brain atrophy in addition to white matter (WM) lesions are major findings in multiple sclerosis (MS), the most common demyelinating disease in young adults.1 Atrophy and lesion load in MS are most commonly estimated on structural magnetic resonance imaging (MRI) and are increasingly used for managing patients and evaluating newer treatments in multicenter trials.25 Robust and automatic segmentation techniques are critical for estimating lesion load and brain atrophy without operator bias.

Various published automatic techniques for segmenting MS lesions have been reviewed.6,7 These two reviews have clearly articulated the strengths, weaknesses, and general limitations of various automatic segmentation methods. As pointed out in these reviews, major challenges in automatic segmentation of MS lesions arise from their heterogeneity in terms of location, shape, and intensity. Also, most of the automatic segmentation methods classify diffuse part of the lesions as gray matter (GM).

Thus, performance of these automatic segmentation techniques appears suboptimal, especially when the data are acquired at different centers and different scanners.6 Therefore, there is a need for automatic segmentation techniques that provide accurate results on data acquired at multiple centers using different scanner platforms. Deep learning (DL) appears to hold great promise in realizing this objective. A major advantage of DL is that image features can be extracted automatically without human intervention.8 Convolutional neural networks (CNNs) are a class of machine-learning methods inspired by the organization of neurons in the visual cortex.8 DL models based on CNN architectures are thus particularly well suited for image analysis and have been extensively used for image segmentation.9,10 U-net, a fully convolutional neural network (FCNN) consisting of encoding and decoding stages, is popular in medical image processing.11 Application of DL to medical image analysis is an active area of research.12,13

We recently applied DL for brain tissue segmentation based on MRI acquired on the same cohort as used in the current study using a fixed training size.14 In the current study we evaluate the effect of training size on the segmentation accuracy, using the same network architecture described earlier.14

DL requires a large number of annotated or labeled images for training. Access to large annotated medical images is a major problem because of patient confidentiality concerns, lack of standardized protocols, and the expensive nature of annotation. A number of strategies have been devised to reduce the training size requirements. These include fine tuning a pretrained network optimized on large data in a different domain (ie, transfer learning),9,1517 using a pretrained network in the same domain,18 and data augmentation.19 For now, DL-based image segmentation relies on the availability of large annotated data. Thus, there is a need to determine the minimum sample size that is required to train deep networks. As far as we are aware, this is the first study to address this problem for segmenting brain MRI. Specifically, what is the minimum training size that is required for realizing desired segmentation accuracy? We are aware of only two publications that investigated the training size dependence of the segmentation accuracy in medical images.18,20 However, studies that systematically explored the dependence of segmentation accuracy on training size in medical images using a large number of training sets on a large labeled database have not been reported so far.

DL is used as a classifier in segmentation. The dependence of classifier performance on training size was addressed in a number of publications.2126 The learning curve approach has been proposed for evaluating the classifier performance as a function of the training size.22,27 These learning curves follow an inverse power law.26

The main purpose of this study was to investigate the effect of training size on the segmentation accuracy using the learning curve approach. A secondary objective was to investigate the dependence of lesion segmentation accuracy as a function of lesion size.

Materials and Methods

Ethics Statement

Local Institutional Review Board (IRB) approvals for scanning the patients were obtained by the participating centers. All patients signed informed consent. Our IRB approved the analysis of the MRI data. This study is fully HIPAA-compliant.

Image Dataset

The MRI data used in this study were acquired as part of CombiRx, a multicenter, double-blinded, randomized clinical trial (Clinical trial identifier: NCT00211887), supported by the National Institutes of Health (NIH). Patients were recruited consecutively between 2005 and 2009. The demographic and clinical information is summarized in Table 1. In all, 1008 patients from 68 centers were recruited for this study.28,29 MRI data were acquired on multiple platforms at 1.5T (85%) and 3T (15%) field strengths (Philips, Best, Netherlands, or GE, Milwaukee, WI, or Siemens, Erlangen, Germany). The MRI protocol included acquisition of 2D FLAIR (echo time / repetition time / inversion time (TE/TR/TI) = 80–100/10,000/2500–2700 msec) and 2D dual echo turbo spin echo (TSE) images (TE1/TE2/TR = 12–18/80–110/6800 msec; echo train length 8–16), and pre-and postcontrast T1-weighted (T1w) (TE/TR = 12–18/700–800 msec) images with identical geometry. The voxel dimensions were 0.94 × 0.94 × 3 mm. Except for a slight variation in the values of TE and TI, the same MRI protocol was used across all centers. As a part of the CombiRx trial, all images were evaluated for quality and segmented by a semiautomatic software.3032 The segmentation results of the semiautomatic software were validated by two experts; one expert (P.A.N). is an imaging scientist with 30+ years of experience in MRI of MS and other neurological disorders, and the second expert (J.S.W.) is an MS neurologist with 30+ years of experience in MRI of MS. These validated segmentation results served as the ground truth for network training and evaluation.

TABLE 1.

Demographic and Clinical Data on the CombiRx Cohort (Adapted from29)

Age (yrs) 37.7 ± 9.7
 Female/Male (ratio) 72.4/27.6
 Race (%) Caucasian 87.6
African American 7.2
Other 5.2
 Ethnicity (%) Hispanic 6.3
Non-Hispanic 89.5
Other 4.3
 Symptom duration (yrs) 4.8 ± 5.6
 EDSS score (range) at screening 2 (0–6.5)

Data Preprocessing

The FLAIR and T1w images were aligned with the T2w images using rigid-body registration. All images were preprocessed using an MRI automated processing (MRIAP) pipeline, a validated semiautomatic software package that combined parametric and nonparametric techniques.30 The image preprocessing steps included skull stripping, bias field correction, intensity normalization, and anisotropic diffusion filtering for noise reduction, as described elsewhere.3032

Network Description

Figure 1 shows the multiclass U-net network used in this study. U-net consists of encoding and decoding paths, composed of convolutional blocks, each with multiple layers. The convolutional blocks in the encoding path were connected to the corresponding blocks in the decoding path, allowing the retention of features from higher resolutions. Additionally, blocks in the encoding path were followed by a max pooling layer, while blocks in the decoding path were followed by an upsampling layer. Max pooling layers were assigned a window size of 2 × 2, effectively reducing resolution by a factor of 2 along both image dimensions. Upsampling layers were used to match learned feature dimensions with block resolution before max pooling. Processing layers in each block were assigned a number of filters (also called feature maps), starting with 64 filters for the first convolutional block and the number of filters was doubled for all succeeding convolutional blocks in the encoding path. This was done for five stages, the last convolutional block having 1024 filters for each of its convolutional layers. In the decoding path, a similar approach was used, but the number of filters was halved. The last deconvolutional block of the network had the same number of filters as the first convolutional block in the network. All convolution/deconvolution layers in the network were followed by rectified linear unit (ReLU) activation. For the last layer of the network, softmax activation was used to assign probability scores for each of the segmented tissue classes.

FIGURE 1:

FIGURE 1:

Architecture of the U-net used for brain segmentation. The image dimension is denoted next to each layer, and the number of channels (features) is listed at the top.

Network Training

Preprocessed CombiRx images (FLAIR, T2w, PDw, and precontrast T1w) served as input to the U-net. The validated segmentation was used as the ground truth for training the network. Of the 1008 image volumes, three were discarded because of poor signal-to-noise ratio and five were excluded because of motion artifacts. Analysis was performed on the remaining 1000 baseline images sets that were randomly partitioned into training (80%), validation (10%), and test (10%) sets. The training data were subsequently partitioned into 16 subsets for investigating the effect of training size on the segmentation accuracy: 800 (80% of the scans), 500 (50%), 400 (40%), 250 (25%), 150 (15%), 120 (12%), 100 (10%), 90 (9%), 80 (8%), 70 (7%), 60 (6%), 50 (5%), 40 (4%), 30 (3%), 25 (2.5%), and 10 (1%). The network was trained separately using these various subsets and the corresponding accuracies (Dice similarity coefficients; DSC) were determined using the test set.

Training was performed under similar conditions for all the training sets. For each set the network was trained for 500 epochs (one epoch = one iteration through training data) using stochastic gradient descent (SGD)33 as the optimizer. The batch size was fixed at eight for all sample sizes. We used an initial learning rate of 10−2 and Nesterov momentum.34 Network weights were initialized using the Xavier algorithm.35 A balanced version of Dice score coefficient was selected as the loss function36 to help alleviate imbalance between tissue classes. The learning rate was reduced by a factor of 0.4 every 15 epochs if training loss did not decrease. To prevent overfitting of training data and improve generalization, a “best model” strategy was adopted. For robustness against varying conditions (learning rate reduction, initial network weights, network optimization) training for each dataset size was conducted four times. The rationale for the choice of network architecture and the parameters were described in detail in our recent publication.14

Training was implemented on the Maverick2 cluster at Texas Advanced Computing Center (TACC) with four NVIDIA GTX 1080 Ti graphics processing unit (GPU) cards using the Python Keras library37 and TensorFlow.38

Inverse Power Law Fitting

Following Refs. 22, 26, we used an inverse power law (Eq. [1]) to fit the accuracy as a function of the training size:

Y=(1a)b1Xb2 (1)

In the above equation, Y represents the accuracy and X represents the training size, (1-a) is the highest accuracy that is obtained with a large sample, and b1 and b2 represent the learning and decay rates, respectively. The learning curve parameters were determined by fitting the accuracy for each training size. Ideally, the parameter a should be close to 0 and b2 negative. We used the nonlinear weighted least squares method to fit the curve as described by Figueroa et al.26 The 95% confidence intervals (CIs) were calculated based on the Hessian matrix and second-order derivatives on the inverse square law function used for curve fitting.26 This CI is referred to as the prediction CI.

To further demonstrate the value of the predictive model we also fitted the model to the DSC with three different sets of training data and then compared the “predicted” DSC with the observed DSC values obtained with larger training set sizes for lesions. The small training sets included the first five, seven, and nine data sizes of the 16 total training datasets, corresponding to up to 50, 70, and 90 image sets, respectively. We also calculated the corresponding 95% CIs following the published procedure and using the computer scripts provided by Figueroa et al.26

Effect of Lesion Size

The effect of lesion size on T2 lesion segmentation was also assessed. For this purpose, lesions were divided into seven categories, somewhat arbitrarily, based on their volumes: 0–19 μl; 20–34 μl; 35–69 μl; 70–137 μl; 138–276 μl; 277–499 μl; >500 μl. The DSC, true-positive (TPR), and false-positive (FPR) rates were calculated for each lesion category.

Statistical Analysis

Agreement between the network segmentation and ground truth for all training sets was measured by computing DSC using Eq. [2]:

DSCk=2×TPk/(FPk+2×TPk+FNk) (2)

where TPk, FPk, and FNk represent the number of true-positive, false-positive, and false-negative classification of tissue class k, respectively. This analysis was performed using scikit-learn 0.20.2 (https://scikit-learn.org/stable/) and nibabel 2.3.3 (https://nipy.org/nibabel/) in Python 3.6. We also evaluated the network model correlating the network-determined volumes with the ground truth using Excel 2013.

Results

The average ± standard deviation (SD) of T2 lesion volumes at baseline in this cohort was 12.2 ± 13.2 ml. The corresponding values for WM and GM, respectively, were 469.2 ± 55.64 ml, and 588.0 ± 63.45 ml.

As an example, the input FLAIR image, expert-validated segmentation, and the CNN segmented images for training set sizes of 800, 400, 100, 50, 25, and 10 for one slice of an MS patient in the test set are shown in Fig. 2. In this case, reduced performance can be observed, at least visually, as the training set size decreased. For training sizes of 10 and 25, the network missed the lesions completely. However, good lesion segmentation was seen for training size ≥50, and the segmentation accuracy generally improved with larger training sizes.

FIGURE 2:

FIGURE 2:

FLAIR, validated, and segmented images for different training sizes. The training sizes are 800 (a), 400 (b), 100 (c), 50 (d), 25 (e), and 10 (f). Note the markedly suboptimal lesion segmentation for training sizes of 25 and 10. Lesions, WM, GM, and CSF are shown in magenta, white, gray, and blue, respectively.

Correlation between tissue volumes segmented by the network model and the ground truth is shown in Fig. 3. The high correlation between the network-generated volumes and the ground truth can be easily appreciated.

FIGURE 3:

FIGURE 3:

Correlation between the U-net determined volumes for different tissues and the ground truth. The excellent correlation between the network-derived volumes and ground truth can be seen in this figure.

The average segmentation accuracy, as assessed by DSC, for different training set sizes for each tissue class is summarized in Table 2. The DSC values for GM, WM, and cerebrospinal fluid (CSF) were 0.87 ± 0.009, 0.86 ± 0.008, 0.91 ± 0.009 for the smallest training size of 10. In contrast, the DSC for T2 lesions was 0.00 for this training size. For GM, WM, and CSF, the DSC increased slowly, but monotonically, with training size, reaching maximum values of 0.94 ± 0.004, 0.94 ± 0.005, 0.96 ± 0.003 at the largest training size of 800. In contrast, the DSC for T2 lesions increased rapidly initially and then slowly, reaching a maximum value of 0.86 ± 0.016 when the training size reached 800.

TABLE 2.

DSC Values (Average ± SD) for Various Segmented Tissues for Different Training Sizes

DSC (mean ± SD)
Training size GM WM CSF T2
 10 0.87 ± 0.009 0.86 ± 0.008 0.91 ± 0.009 0
 25 0.88 ± 0.003 0.88 ± 0.007 0.93 ± 0.001 0
 30 0.89 ± 0.012 0.87 ± 0.031 0.93 ± 0.004 0.60 ± 0.138
 40 0.90 ± 0.006 0.89 ± 0.013 0.94 ± 0.002 0.68 ± 0.076
 50 0.91 ± 0.009 0.90 ± 0.014 0.94 ± 0.006 0.75 ± 0.039
 60 0.91 ± 0.007 0.91 ± 0.010 0.94 ± 0.002 0.77 ± 0.033
 70 0.91 ± 0.006 0.90 ± 0.010 0.94 ± 0.002 0.78 ± 0.036
 80 0.91 ± 0.005 0.90 ± 0.008 0.94 ± 0.003 0.79 ± 0.040
 90 0.91 ± 0.007 0.89 ± 0.013 0.94 ± 0.003 0.79 ± 0.037
 100 0.91 ± 0.003 0.90 ± 0.004 0.95 ± 0.003 0.79 ± 0.029
 120 0.92 ± 0.009 0.91 ± 0.016 0.95 ± 0.003 0.78 ± 0.043
 150 0.93 ± 0.003 0.92 ± 0.005 0.95 ± 0.002 0.80 ± 0.034
 250 0.92 ± 0.004 0.90 ± 0.007 0.96 ± 0.002 0.84 ± 0.011
 400 0.94 ± 0.003 0.93 ± 0.005 0.96 ± 0.002 0.85 ± 0.011
 500 0.94 ± 0.005 0.93 ± 0.007 0.96 ± 0.002 0.85 ± 0.022
 800 0.94 ± 0.004 0.94 ± 0.005 0.96 ± 0.003 0.86 ± 0.016

Figure 4 shows the fitted learning curves for all the tissues. The parameters (a, b1, b2), based on the least square fit, were (0, 0.19, −0.18), (0, 0.2, −0.16), (0, 0.13, −0.20), and (0.1, 4.53, −0.83) for GM, WM, CSF, and lesions, respectively. As can be seen from this figure, all the curves for GM, WM, and CSF show excellent fit with root mean square errors (RMSE) of 4.18 × 10−3, 7.8 × 10−3, and 1.81 × 10−3, respectively. For the smallest training size, the learning curve for T2 lesions does not pass through the experimental points since the inverse-power-law is known to break down when the training size is small.22 But beyond these two points, the learning curve showed excellent fit with an RMSE of 1.81 × 10−2. These results demonstrate that the training size needed for a given target accuracy can be estimated.

FIGURE 4:

FIGURE 4:

Dependence of DSC on the training size for GM, WM, CSF, and T2 lesions. The points represent the Dice coefficient for each of the training sizes. The theoretically predicted dependence, based on the power law, is shown in dashed lines. The error bars represent SD.

The plots in Fig. 5 summarize the results of curve fitting for lesions with the three smaller training sizes consisting of the first five, seven, and nine subsets. The relatively small RMSE values of 5.86 × 10−03, 5.30 × 10−03, and 5.28 × 10−03 for these three training sets suggest an excellent fit. These plots also show the 95% CIs for all three training sets used for fitting data. The fitted curve generated using all the 16 training sets along with the 95% CI are also included in these plots. These plots show that the observed DSC values with these three smaller training sets are very close to the value obtained with the largest training size of 800 (predicted value). However, the 95% CI quickly becomes tighter with increased training sets.

FIGURE 5:

FIGURE 5:

Prediction of accuracy using smaller training sets for lesions. The plots (a), (b), and (c) are based on 5, 7, and 9 training subsets, respectively. The solid circles represent the data used for fitting the curves. The open circles represent the observed values. The 95% CI for the data, referred to as the observed CI (dotted lines), is also shown. The fitted curves (dashed lines) are based on the smaller training sets (solid circles). The 95% CI for the prediction (dash-dotted lines) was calculated from the fit model. This figure shows that all three smaller training sets have successfully predicted the observed values for large training sizes.

As stated above, the DSC of lesions showed much stronger dependence on the training size. We therefore investigated the correlation between CNN output and the ground truth for lesions for different training sizes. Figure 6 shows these plots for a few training sizes. A complete lack of correlation for the training size ≤25 image volumes was observed. Please note that for small sample size, the network segmented background noise as lesions, resulting in large false classification. Good correlation was observed for the training size >50.

FIGURE 6:

FIGURE 6:

Correlation between the network-derived total lesion volumes and the ground truth for different training sizes (indicated by the number on top). Progressive improvement in the correlation with increased training size can be seen.

The dependence of the DSC, TPR, and FPR values on the training size for different lesion sizes is summarized in Fig. 7. As can be seen from this figure, the accuracy is poor for smaller lesion volumes for all training sizes compared with larger lesion volume. For example, when segmenting lesions less than 70 μl, to obtain an accuracy between 0.5 and 0.6, the minimum training set needed is >400. In contrast, for lesion volumes greater than ~300 μl, an accuracy of 0.7 can be achieved for a training size of ~200. Similarly, the FPR was high and TPR was low for small lesion sizes.

FIGURE 7:

FIGURE 7:

The dependence of the three performance measures (DSC, TPR, and FPR) on the training size for different lesion sizes. These plots show progressive improvement in segmentation with increased lesion size. Robust segmentation upon increased training size was also observed. Different shades of gray represent different lesion sizes. The lightest and darkest gray shades represent the smallest (S1) and the largest (S7) lesions, respectively.

Discussion

In this study we systematically assessed the network performance as a function of training size in segmenting brain MRI in MS using the learning curve approach. This was possible because of the availability of large annotated image data. We further validated our results by demonstrating the predictive nature of model for smaller training sizes.

The network model appears to perform well based on the high correlation between the network output and ground truth for all tissues, including lesions, for large training size. Our results suggest that the segmentation accuracy of GM, WM, and CSF is relatively insensitive to the training size. In fact, excellent segmentation accuracy can be achieved for GM, WM, and CSF with a training size as small as 10. However, this is not true for lesion segmentation. For meaningful lesion segmentation, accuracy training sizes of more than 50 are needed. In fact, for training sizes of 10 and 25 datasets, the network completely failed to segment lesions. This is also confirmed by the lack of correlation between the network output on lesion volume and the ground truth. A reason for the need of larger training size for lesion segmentation relative to GM, WM, and CSF is that lesion class has a much smaller number of voxels than the rest of the tissues.

The predictive value of our model was further validated by predicting the accuracy using smaller training sets (Fig. 5). These results suggest the predictive ability of the model. As expected, width of the CIs decreased with the increased training size.

As shown in the Results, the training size needed for a given accuracy strongly depends on the lesion volume. For lesion volumes of 500 μl, an accuracy of 0.8 can be achieved with a small training set of 30. In contrast, even with a training size of 800, the accuracy was only 0.5 for lesion size smaller than 70 μl. This demonstrates that a large training size is necessary for segmenting small lesions. This also suggests that our observed accuracy of ~0.8 for a training size of 100 is mainly determined by the large lesions.

The improvement of lesion segmentation accuracy with increasing training data sizes seems to primarily result from the increased TPR for the small lesions. However, the FPR for small lesion segmentation hardly showed improvement with training data sizes over 50. It is possible that the high FPR rate is related to the errors in the ground truth.14

As indicated in the introduction, there are only two other publications that explicitly addressed the effect of training size on DL-based segmentation. The study by Wong et al18 used pretrained network for image classification. It did not use the learning curve approach for systematically determining the sample size for a desired accuracy. Cho et al20 used the learning curve approach using GoogLeNet. Their focus was on computed tomography (CT) images. The largest number of CT slices for a training set was 200. In contrast, we systematically investigated the accuracy of GM, WM, CSF, and lesion segmentation in MS using 16 sets with sizes between 10 and 800. In addition, we investigated the effect of training size as a function of lesion size. Using curriculum learning, Bengio et al39 achieved 0.82 accuracy for brain tumor classification with 91 training samples and 86% accuracy with 108 training samples for cardiac semantic classification. Because of the differences in imaging modalities and network models, it is difficult to compare our results with others.

This study also has limitations. We determined the minimum training size based on multichannel images for tissue segmentation. The minimum training size depends on a number of factors, such as the acquisition protocol, type of tissues to be segmented, etc. Indeed, the results are not just linked to the dataset but also linked to the specific neural network configuration. Thus, our results may not be completely generalizable. Nevertheless, we believe that this is the first study that systematically studied the training size effect on segmenting MRI of MS brains and provides estimates for the minimum training size needed for accurate segmentation.

The results also depend on the ground truth. Since the ground truth is a mix of semiautomated segmentation and expert evaluation, it is still possible that the ground truth contains some inaccuracies. However, it is worth pointing out that imperfect ground truth is not an uncommon problem in medical images.

In conclusion, excellent segmentation was achieved with a training size as small as 10 image volumes for GM, WM, and CSF. In contrast, a training size of at least 50 image volumes was necessary for adequate lesion segmentation. The power law dependence allows prediction of the minimum training size needed for a given DSC target.

Acknowledgments

Contract grant sponsor: NINDS of the National Institutes of Health; Contract grant number: 1R56NS105857; Chair in Biomedical Engineering Endowment.

References

  • 1.Wallin MT, Culpepper WJ, Campbell JD, et al. The prevalence of MS in the United States: A population-based estimate using health claims data. Neurology 2019;92:e1029–e1040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Moccia M, de Stefano N, Barkhof F. Imaging outcome measures for progressive multiple sclerosis trials. Mult Scler 2017;23:1614–1626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Sastre-Garriga J, Pareto D, Rovira À. Brain atrophy in multiple sclerosis: Clinical relevance and technical aspects. Neuroimaging Clin 2017;27: 289–300. [DOI] [PubMed] [Google Scholar]
  • 4.Ontaneda D, Fox RJ. Imaging as an outcome measure in multiple sclerosis. Neurotherapeutics 2017;14:24–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hemond CC, Bakshi R. Magnetic resonance imaging in multiple sclerosis. Cold Spring Harb Perspect Med 2018;8:a028969. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Danelakis A, Theoharis T, Verganelakis DA. Survey of automated multiple sclerosis lesion segmentation techniques on magnetic resonance imaging. Comput Med Imaging Graph 2018;70:83–100. [DOI] [PubMed] [Google Scholar]
  • 7.Garccia-Lorenzo D, Francis S, Narayanan S, Arnold DL, Collins DL. Review of automatic segmentation methods of multiple sclerosis white matter lesions on conventional magnetic resonance imaging. Med Image Anal 2013;17:1–18. [DOI] [PubMed] [Google Scholar]
  • 8.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;1:436–444. [DOI] [PubMed] [Google Scholar]
  • 9.Litjens G, Kooi T, Bejnordi BE, et al. A survey on deep learning in medical image analysis. Med Image Anal 201760–88. [DOI] [PubMed] [Google Scholar]
  • 10.Ravi D, Wong C, Deligianni F, et al. Deep learning for health informatics. IEEE J Biomed Heal Inform 2017;21:4–21. [DOI] [PubMed] [Google Scholar]
  • 11.Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation In: Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics); 2015. [Google Scholar]
  • 12.Akkus Z, Galimzianova A, Hoogi A, Rubin DL, Erickson BJ. Deep learning for brain MRI segmentation: State of the art and future directions. J Digit Imaging 2017;449–459. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Shen D, Wu G, Suk H-I. Deep learning in medical image analysis. Annu Rev Biomed Eng 2017;19:221–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Gabr RE, Coronado I, Robinson M, et al. Brain and lesion segmentation in multiple sclerosis using fully convolutional neural networks: A large-scale study. Mult Scler J 2019;1352458519856843. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Torrey l, Shavlik J. Transfer learning In: Handbook of research on machine learning applications, vol. 3 2009;17–35. [Google Scholar]
  • 16.Tajbakhsh N, Shin JY, Gurudu SR, et al. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Trans Med Imaging 2016;35:1299–1312. [DOI] [PubMed] [Google Scholar]
  • 17.Chartrand G, Cheng PM, Vorontsov E, et al. Deep learning: A primer for radiologists. Radiographics 2017;2113–2131. [DOI] [PubMed] [Google Scholar]
  • 18.Wong KCL, Syeda-Mahmood T, Moradi M. Building medical image classifiers with very limited data using segmentation networks. Med Image Anal 2018;49:105–116. [DOI] [PubMed] [Google Scholar]
  • 19.Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning, vol. 1 Cambridge, MA: MIT Press; 2016. [Google Scholar]
  • 20.Cho J, Lee K, Shin E, Choy G, Do S. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy? arXiv Prepr arXiv151106348; 2015. [Google Scholar]
  • 21.Kalayeh HM, Landgrebe DA. Predicting the required number of training samples. Pattern Anal Mach Intell IEEE Trans 1983;5:664–667. [DOI] [PubMed] [Google Scholar]
  • 22.Mukherjee S, Tamayo P, Rogers S, et al. Estimating dataset size requirements for classifying DNA microarray data. J Comput Biol 2003; 10:119–142. [DOI] [PubMed] [Google Scholar]
  • 23.Tam VH, Kabbara S, Yeh RF, Leary RH. Impact of sample size on the performance of multiple-model pharmacokinetic simulations. Anti-microb Agents Chemother 2006;50:3950–3952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Dobbin KK, Zhao Y, Simon RM. How large a training set is needed to develop a classifier for microarray data? Clin Cancer Res 2008;14: 108–114. [DOI] [PubMed] [Google Scholar]
  • 25.Kim S-Y. Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC Inform 2009;10:147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample size required for classification performance. BMC Med Inform Decis Mak 2012;12:8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Cortes C, Jackel LD, Solla SA, Vapnik VDJA. Learning curves: Asymptotic values and rate of convergence In: Adv Neural Inf Process Syst 1994;327–334. [Google Scholar]
  • 28.Lindsey JW, Scott TF, Lynch SG, et al. The CombiRx trial of combined therapy with interferon and glatiramer acetate in relapsing remitting MS: Design and baseline characteristics. Mult Scler Relat Disord 2012; 1:81–86. [DOI] [PubMed] [Google Scholar]
  • 29.Lublin FD, Cofield SS, Cutter GR, et al. Randomized study combining interferon and glatiramer acetate in multiple sclerosis. Ann Neurol 2013;73:327–340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Datta S, Narayana PA. A comprehensive approach to the segmentation of multichannel three-dimensional MR brain images in multiple sclerosis. NeuroImage Clin 2013;2:184–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Datta S, Sajja BR, He R, Wolinsky JS, Gupta RK, Narayana PA. Segmentation and quantification of black holes in multiple sclerosis. Neuroimage 2006;29:467–474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Sajja BR, Datta S, He R, et al. Unified approach for multiple sclerosis lesion segmentation on brain MRI. Ann Biomed Eng 2006;34:142–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Bottou L. Large-scale machine learning with stochastic gradient descent In: Proc COMPSTAT’2010. Berlin: Springer; 2010:177–186. [Google Scholar]
  • 34.Sutskever I, Martens J, Dahl G, Hinton G. On the importance of initialization and momentum in deep learning. In: Int Conf Mach Learn; 2013: 1139–1147. [Google Scholar]
  • 35.Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proc 13th Int Conf Artif Intell Stat; 2010:249–256. [Google Scholar]
  • 36.Sudre CH, Li W, Vercauteren T, Ourselin S, Cardoso MJ. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations In: Deep Learn Med Image Anal Multimodal Learn Clin Decis Support. Berlin: Springer; 2017: 240–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Chollet F. Keras: Deep learning library for theano and tensorflow. URL https://keras.io/k 2015;7. [Google Scholar]
  • 38.Abadi M, Barham P, Chen J, et al. Proceedings of the TensorFlow: A system for large-scale machine learning. In: Proc 12th USENIX Conf Oper Syst Des Implement; 2016. [Google Scholar]
  • 39.Bengio Y, Louradour J, Collobert R, Weston J. Curriculum learning. In: Proc 26th Annu Int Conf Mach Learn; 2009:41–48. [Google Scholar]

RESOURCES