Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Dec 1.
Published in final edited form as: J Am Acad Dermatol. 2020 May 17;87(6):1343–1351. doi: 10.1016/j.jaad.2020.05.056

Deep Learning for Dermatologists: Part I Fundamental Concepts

Dennis H Murphree 1,10, Pranav Puri 2,10, Huma Shamim 3,10, Spencer A Bezalel 3,10, Lisa A Drage 3,10, Michael Wang 4, Mark R Pittelkow 5,10, Rickey E Carter 6, Mark DP Davis 3,10, Alina G Bridges 3,8,10, Aaron R Mangold 5,10, James A Yiannias 5, Megha M Tollefson 3,10, Julia S Lehman 3,8,10, Alexander Meves 3,10, Clark C Otley 3,10, Olayemi Sokumbi 7,9,10, Matthew R Hall 7,10, Nneka Comfere 3,8,10
PMCID: PMC7669702  NIHMSID: NIHMS1596933  PMID: 32434009

Abstract

Artificial intelligence (AI) is generating substantial interest in the field of medicine. One form of artificial intelligence, deep learning, has led to rapid advances in automated image analysis. In 2017, an algorithm demonstrated the ability to diagnose certain skin cancers from clinical photographs with the accuracy of an expert dermatologist. Subsequently, deep learning has been applied to a range of dermatology applications. Though experts will never be replaced by AI, it will certainly impact the specialty of dermatology. In this first article of a two-part series, the basic concepts of deep learning will be reviewed with the goal of laying the groundwork for effective communication between clinicians and technical colleagues. In part two of the series, the clinical applications of deep learning in dermatology will be reviewed considering limitations and opportunities.

Keywords: deep learning, machine learning, dermatology, artificial intelligence

Introduction

Deep learning (DL)1 is a subset of machine learning (ML)2,3 that has proven to be particularly effective in medical image analysis. Recent studies have reported algorithmic performance at levels on par with or greater than human levels of accuracy for tasks as varied as detecting tuberculosis in chest radiographs4 to distinguishing melanoma from benign skin lesions based on clinical images5. Though there is much speculation6 surrounding the future impact of DL, a recent JAAD review7 found that dermatologists remain underrepresented in publications on DL in dermatology. Through this article we hope to provide dermatologists the conceptual understanding of DL necessary to collaborate on DL applications.

Artificial Intelligence and Machine Learning

Machine learning and its subset deep learning (Fig. 1) are both components of the broader framework known as artificial intelligence (AI)8. These as well as other forms of AI use a combination of algorithms and data to accomplish a specific task or to answer a specific question. In the context of clinical practice this task or question might include understanding the current state of the patient (diagnosis), predicting the future state of the patient (prognosis), or predicting outcomes and complications of therapeutic interventions.

Figure 1:

Figure 1:

Deep learning and machine learning are both subfields of artificial intelligence differing primarily by the type of algorithm/model used. Unlike general AI, which is the broad use of machines to imitate intelligent human behavior, both deep learning and machine learning learn from data without being explicitly programmed.

Machine learning contains a wide variety of algorithms including linear models, tree-based models and shallow neural networks. Deep learning is a subset of machine learning and is currently primarily synonymous with deep artificial neural networks. Although it is arbitrary, a neural network could be considered shallow with three or fewer layers, and deep with more than three.

Machine learning approaches2,3,9 (Table 1) can be categorized both by the type of question they answer and by the type of data they consider. In supervised learning the machine is trained on data (independent variables) associated with a known outcome/response (dependent variable). Supervised learning is common in predictive problems and includes traditional methods such as logistic regression as well as new approaches like neural networks. For example, classifying photographs of skin lesions using pathology diagnoses as the known outcome would be supervised learning. There are no known outcomes in unsupervised learning, and the algorithm seeks to discover potential relationships between data points. For example, in drug discovery, an unsupervised learning approach such as hierarchical clustering can find new molecules similar to those already known to be effective in disease treatment. Within supervised learning, regression algorithms predict a continuous variable such as survival probability, while classification algorithms predict a categorical outcome such as benign versus malignant.

Table 1.

Essential Terms and Concepts

Machine Learning Vocabulary
Supervised Learning Learns a known outcome such as lesion identification or sentinel lymph node metastasis
Unsupervised Learning Learns without a known outcome, for example grouping similar patients in a cohort
Classification Predicts content of image as a whole – is this a photo of melanoma or SK?
Segmentation Predicts regional contents of image – which parts of this biopsy are tumor regions? Which are inflammatory response?
Object Detection Finds object of interest – where is the mole in this photo?
Predictors Input variables, also known as independent variables or covariates. In an image these will be pixels.
Response The outcome you are trying to predict. In an image classification problem this will be the category of the input image.
Neural Network Training Definitions
Training Data Subset of cohort that is used for improving performance
Test Data Subset of cohort used solely for evaluating network. These data are never considered during training.
Ground Truth Gold standard outcome associated with an image or observation. In studies where there may be disagreement between readers the ground truth can be an average or a majority vote of professional opinions.
Backpropagation Technical algorithm important for efficient training. Not a factor requiring decision/knowledge in practical experiments. While backpropagation is important and discussed in many textbooks, in practice it is completely abstracted by software.
Epoch A training iteration during which the network examines all possible data points. One “pass” through the entire training set. Performance will often be plotted as a function of epoch.
Batch Size The number of training examples considered in a given weight update. The batch size in important both because it affects both overall training speed and performance, and because it may dictate the size of GPU needed. A GPU is a graphical processing unit, a specialized computer chip helpful for deep learning.
Neural Network Components
Weights or Parameters “instructions” a network uses to make predictions. Weights, also known as parameters, are updated during training until they reach values that maximize network performance.
Node Basic computational unit of a neural network. Nodes combine inputs, weights, and an activation function to produce an output.
Activation Function Non-linear function that’s part of a node or potentially a layer of nodes. These functions are one of the main features that let neural networks improve over more basic linear models. Common activation functions include Softmax and ReLu. The activation function of the final layer must be chosen to match the type of output the network is predicting.
Loss Function / Loss A loss function quantifies how well a network predicts on a given dataset. The quantity it outputs is called the loss, although loss and loss function are often used interchangeably. Higher loss means worse performance, so the goal of training is to minimize the loss. Common loss functions include binary cross-entropy and categorical cross-entropy (commonly mildly misattributed as Softmax loss)
Optimization Method Numerical routine used to update the weights so as to minimize the loss, thus improve performance. Common optimization methods include SGD, RMSProp, and Adam.
Learning Rate The learning rate is an important parameter used by the optimization routine. It influences how much weights are adjusted during training. Best learning rates are likely found through experimentation.

Typically, machine learning approaches accept input variables known as covariates or predictors and combine them to predict a specific outcome, often referred to as a response. In a simplified example, a patient’s age and sun exposure history are the covariates, and the prediction for basal cell carcinoma (BCC) risk is the response. When used with images, the input variables are pixels from an image.

Response/outcome variables depend on the question of interest. Common methods for image analysis include classification, detection, and segmentation. In classification, the outcome is a categorization of the entire image, such as whether a photograph is of a BCC or an angiofibroma. In object detection, the outcomes are the rectangular regions that surround the objects of interest, such as locations of all BCCs on a patient’s face. In segmentation, the outcome is the category of each pixel within the input image. For example, a segmentation network might identify the borders of a BCC. Segmentation is a subtype of classification, which classifies at the pixel resolution rather than the entire image.

A common form of unsupervised learning, where there is no outcome/response, is clustering. Clustering assigns data to groups based on similarity, for example to identify chemicals with similar structure. An important consideration in clustering is choosing how to measure similarity. Unsupervised learning is not currently common in dermatology.

The most promising advances in medical image analysis have leveraged a type of algorithm known as a neural network, and more specifically a variety known as a deep neural network (DNN)1013. DNNs and their applications in dermatology are the focus of this review. Most current research involves supervised learning on known outcomes such as identifying melanoma from dermatoscopic images or predicting metastasis risk from digitized pathology slides.

Artificial Neural Networks

Although analogies to biological nervous systems are common, an artificial neural network (ANN) is essentially a statistical model2,3,8,9 and is perhaps most usefully thought of as a small computer program that uses instructions called weights. These weights, which are adjusted during the training process, mathematically combine the input variables to calculate the predicted output. Weights are also known as parameters.

Neural networks differ from other linear or nomogram-based models such as the Melanoma Outcome Calculator14 by incorporating a special function known as an activation function. This non-linear function allows a neural network to capture a much broader spectrum of relationships between the input variables than can be captured by linear models such as logistic regression.

Deep neural networks are sequentially ordered layers of “shallow” ANNs, with output from one layer becoming input to the next (Fig. 2). Each subsequent layer of a deep neural network learns something sequentially more complex. For example, early layers in a skin lesion classifier might identify lines. Middle layers might learn that some lines are curved to form ovals and that certain versions of ovals are neoplasms. The deepest layers might classify these lesions into an output category such as benign nevus, melanoma or seborrheic keratosis.

Figure 2:

Figure 2:

Deep neural networks derive their predictive power from multiple learning layers, each of which identifies sequentially more complex aspects of the input image. In this example an unknown skin lesion is classified as either Benign Nevus, Seborrheic Keratosis or Melanoma by a four layer fully-connected network. This network is considered four layer because input layers, in contrast to the hidden and output layers, never contain tunable weights so cannot help the network learn. The network is fully-connected because each node (blue circle) in a given layer is connected to every node in a subsequent layer. In practice a convolutional neural network (CNN) would be a more likely choice for this lesion classification problem. A CNN is one that contains a special type of layer called a convolutional layer which differs mathematically from the fully connected layers depicted here. Convolutional layers have performance characteristics that make them well suited to image analysis.

Training

To use a neural network, whether deep or shallow, it must first be trained using data. Training means that internal components of the network known as weights are updated so that the network “learns” to answer the question of interest. The most important ingredient to successful training is the dataset. In supervised learning a dataset consists of both observations and labels, where a label is a “ground truth” (gold-standard) description of the corresponding observation. For example, in a lesion classification project, the dataset could be photographs of a variety of skin lesions along with an expert’s diagnosis for each lesion. In a segmentation project, the labels would be images where every pixel has been assigned a label, typically via a drawing tool used to encircle regions. The label quality is critical to the performance of the network, and in many cases the opinions of multiple experts are combined to construct the best label. This is particularly important when there may be substantial disagreement between reviewers.

During training, a network is evaluated on example data and its predictions are compared to the ground truth (the gold standard outcome discussed above). A function called the loss function measures how well the network’s predictions match the ground truth by calculating a quantity known as the loss. The goal of training is to minimize this loss, meaning predictions match the correct outcome as often as possible. An algorithm known as an optimizer updates the network’s weights to reduce loss and improve performance. This process of evaluating network performance on training data then updating the weights is repeated iteratively until the network performance no longer improves. Some of the concepts associated with network training are detailed in Table 1.

Prior to training, a dataset is divided into “training” and “test” subsets. The purpose of the test subset is to evaluate the model on data it has not seen during training. A typical training/test split is 80% of the data for training, 20% for test. Other ways of splitting the data that are common but beyond the scope of this article are cross-validation and bootstrap sampling9.

When training a neural network it is important to avoid overfitting, which occurs when the network learns to mimic its training data very closely but fails to perform well on new datasets. A classic sign of overfitting is observing performance on the training set improving while performance on the test set remains constant or worsens. Overfitting can result from using too large a network for the training dataset, from training too long, or from a bias inherent in the dataset. An example of the latter might be when a network trained on a single skin phototype generalizes poorly to other skin phototypes.

Components

Most varieties of neural networks share a set of common components (Table 1). In general, to specify a network for a given study there are three main components that need to be chosen: architecture, loss function and optimization routine.

Architecture includes the number of layers in a network as well as the number and types of nodes (basic computational units) in a layer. The input layer accepts the basic independent variables as its inputs, while the output layer produces predictions. Hidden layers are those in between the input and output layers. Layers vary in terms of the number and type of nodes they contain. Many architectures for image analysis include a special type of layer called a convolutional layer and are thus termed convolutional neural networks (CNNs)1,8. Designing novel network architectures is a rich subfield of computer science, so practically speaking many applications in dermatology will leverage architectures developed by academic computer vision groups or technology companies such as Google. Examples of publicly available classification architectures include InceptionV315, VGG1616, and Xception17. Popular segmentation and object detection architectures include U-Net18 and YOLO19 respectively.

The loss function evaluates the performance of the network on its training data. A simple example might be the error rate – the number of incorrect results divided by the total number of results. In practical use in dermatology however the choice of loss function will likely be either categorical cross-entropy or binary cross-entropy. These loss functions, described elsewhere2,8,9 in detail, are better able to account for multiple outcomes and rarer conditions than a simple error rate. It is important to distinguish the loss function, which is directly used in training a network, from more general performance metrics such as accuracy or sensitivity. Many performance metrics can be calculated, typically on the test data, but the loss function is the quantity that is being actively improved during training.

Optimizers are algorithms that minimize the loss function described above by adjusting network weights. There is a small set of optimizers in common use currently, and most are variants on an algorithm known as stochastic gradient descent. Although beyond the scope of this article, practically speaking the choice of optimizer algorithm will be determined experimentally. An important parameter in the optimizer is the learning rate, a number which determines the scale of updates to the weights. Like the optimizer, the best learning rate is typically problem dependent, thus best determined through trial and error. Some optimizers allow learning rates to be adjusted during the course of training, adding another layer of complexity.

Improving Performance

There are three main ways to improve the performance of a deep neural network: data augmentation, dropout, and transfer learning.

Data augmentation is a way to increase the size of a dataset by taking advantage of prior knowledge to generate new data8. For example, an image of a melanoma is still labeled a melanoma if it is rotated 90 degrees. In technical terms the label of melanoma is invariant under rotation. Thus for every melanoma image in a dataset, rotated versions of that same image can be generated and added. Data augmentation is very useful because it increases the size of a dataset with relatively little effort.

Dropout is a way of randomly turning off certain parts of the network in order to reduce the risk of overfitting8. The amount of dropout in a network can be determined experimentally and is considered part of the architecture.

Transfer learning is another key tool for improving performance20. Through advances made by prior researchers, networks trained on vast image data sets that have been demonstrated to perform well are available for use by the public. By starting with these pre-trained networks, one can build upon what has been learned from other image sets. Early layers in the pre-trained network will have already “learned” primitives such as curved lines from the very large general dataset. A common approach is to start with a pre-trained network and add a small number of problem-specific layers to the top. During training the bottom layers can be frozen so that their weights are not updated, while the new top layers can be fine-tuned to the problem at hand. As a concrete example, Esteva et al21 took the InceptionV3 network that was pre-trained on ImageNet, replaced the final layer with one appropriate to their experiment, then re-trained on their data. This allowed their network to take advantage of what was learned from the 1.28 million images the original InceptionV3 was trained on without actually needing those images - that information was effectively stored in the pre-trained network’s weights. Without this it is likely they would not achieve such impressive performance.

Practical Considerations

Dataset Size

An important question in any deep learning project is how much data will be required. Unfortunately there is no universal answer, and even reliable guidelines are hard to come by. For example, a 2016 deep learning textbook8 suggested 5,000 examples per output category were needed for “acceptable performance,” with more than 10 million total examples needed to exceed human performance. Meanwhile, an estimate based on performance of AlexNet22, an early neural network, on the large publicly available ImageNet23 suggests 1,000 examples per output category. In the dermatology literature one can find examples ranging between 600 to 14,000 examples per class, with a large study-specific variation in performance. A variety of important experimental factors affect the required dataset size including class balance and similarity, use of transfer learning, and the incorporation of multiple data types. Unfortunately, no simple relationship stands out between quality achieved and sample size.

Computational Infrastructure

Graphical Processing Units, or GPUs, are a specialized type of computer hardware critical to training deep neural networks. They are needed to efficiently perform the calculations required during training. Even with GPUs some experiments take days to weeks of computer time. Special software frameworks are also helpful in training deep neural networks, with current popular packages including TensorFlow24 and PyTorch25. While the above are critical for training, using a network in practice requires only modest resources. Many modern smartphone apps, for example Apple’s Siri, contain trained DNN’s.

Avoiding Mistakes

Training DNNs requires great care to ensure that results are applicable to the clinical goals. This necessitates close collaboration between medical and technical specialists throughout a DNN project. The most challenging and the most important task is to formulate a suitable clinical question. This clinical question then guides data collection, labeling, algorithm choice, training, and result assessment (Fig. 3). A common concern is that networks trained on one population typically do not generalize to different populations, and artificial intelligence projects are susceptible to all the biases that could adversely affect traditional medical research.

Figure 3:

Figure 3:

Steps involved in a stereotypical machine learning study along with a subset of corresponding considerations.

Misleading results can also result from training itself. An important concern is whether the network is learning its intended task or an unrelated situational cue. For example, Esteva5 found a ruler on the skin increased the probability of melanoma, and Yap26 omitted training images with identifiable anatomical features to prevent bias, e.g. a nose in the photograph increasing the probability of BCC. These may or may not be concerns depending on the network’s intended use. In a teledermatology application, the skin ruler might legitimately incorporate information about how concerned the remote provider is. Similarly, if a lesion on the nose might genuinely be more likely to be BCC than the same lesion elsewhere, this information would be useful. Trickier situations occur when potential bias is less obvious. For example, in digital pathology, if all examples of a rare disease are acquired on a specific type of slide scanner (perhaps by a collaborating institution), the network might learn to identify the scanner rather than the disease. This network would then fail to identify the disease if it were acquired on a different scanner. Visualizing what the network is cueing on through techniques like saliency maps27 can help avoid biases like these. Another common pitfall to training is to ensure that multiple examples from the same case do not appear in both training and test datasets, positively biasing performance. For example, all photographs of a given skin lesion must appear in either the test or the training subset, but not split across both.

Discussion and Conclusion

Deep learning is a rapidly evolving field that offers a powerful vehicle towards enhancing the clinical practice of dermatology. In this first part of a two-part series we have reviewed some of the basic concepts of deep learning most applicable to this specialty.

Footnotes

Conflicts of Interest: None declared.

References

  • 1.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436–44. [DOI] [PubMed] [Google Scholar]
  • 2.Hastie T, Tibshirani R, Friedman JH. The elements of statistical learning : data mining, inference, and prediction. 2nd ed. New York, NY: Springer; 2009. [Google Scholar]
  • 3.Kuhn M, Johnson K. Applied predictive modeling. New York: Springer; 2013. [Google Scholar]
  • 4.Lakhani P, Sundaram B. Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks. Radiology 2017;284:574–82. [DOI] [PubMed] [Google Scholar]
  • 5.Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542:115–+. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Emanuel EJ, Wachter RM. Artificial Intelligence in Health Care: Will the Value Match the Hype? JAMA 2019;321:2281–2. [DOI] [PubMed] [Google Scholar]
  • 7.Zakhem GA, Fakhoury JW, Motosko CC, Ho RS. Characterizing the Role of Dermatologists in Developing AI for Assessment of Skin Cancer: A Systematic Review. J Am Acad Dermatol 2020. [DOI] [PubMed] [Google Scholar]
  • 8.Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge, Massachusetts: The MIT Press; 2016. [Google Scholar]
  • 9.James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning : with applications in R. New York: Springer; 2013. [Google Scholar]
  • 10.Litjens G, Ciompi F, Wolterink JM, et al. State-of-the-Art Deep Learning in Cardiovascular Image Analysis. JACC Cardiovasc Imaging 2019;12:1549–65. [DOI] [PubMed] [Google Scholar]
  • 11.Ting DSW, Peng L, Varadarajan AV, et al. Deep learning in ophthalmology: The technical and clinical considerations. Prog Retin Eye Res 2019;72:100759. [DOI] [PubMed] [Google Scholar]
  • 12.Valliani AA, Ranti D, Oermann EK. Deep Learning and Neurology: A Systematic Review. Neurol Ther 2019;8:351–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wang S, Yang DM, Rong R, Zhan X, Xiao G. Pathology Image Analysis Using Segmentation Deep Learning Algorithms. Am J Pathol 2019;189:1686–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Melanoma Outcome Calculator. 2007. at http://www.lifemath.net/cancer/melanoma/outcome/index.php.)
  • 15.Rethinking the Inception Architecture for Computer Vision. 2015. at https://arxiv.org/abs/1512.00567.)
  • 16.Zisserman KSaA. Very Deep Convolutional Networks for Large-Scale Image Recognition. https://arxivorg/abs/14091556. https://arxiv.org/abs/1409.1556: https://arxiv.org/abs/1409.1556; 2014. [Google Scholar]
  • 17.Chollet F Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017. 21–26 July 2017. p. 1800–7. [Google Scholar]
  • 18.Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. 2015; Cham: Springer International Publishing. p. 234–41. [Google Scholar]
  • 19.Redmon J, Divvala S, Girshick R, Farhadi A. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. 27–30 June 2016. p. 779–88. [Google Scholar]
  • 20.Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. Montreal, Canada: MIT Press; 2014:3320–8. [Google Scholar]
  • 21.Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542:115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep Convolutional Neural Networks. Commun Acm 2017;60:84–90. [Google Scholar]
  • 23.Russakovsky O, Deng J, Su H, et al. ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vision 2015;115:211–52. [Google Scholar]
  • 24.Abadi M TensorFlow: Learning Functions at Scale. Acm Sigplan Notices 2016;51:1–. [Google Scholar]
  • 25.Paszke A, Gross S, Massa F, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. NIPS 2019 Proceedings 2019:8024––35. [Google Scholar]
  • 26.Yap J, Yolland W, Tschandl P. Multimodal skin lesion classification using deep learning. Exp Dermatol 2018;27:1261–7. [DOI] [PubMed] [Google Scholar]
  • 27.Simonyan K, Vedaldi A, Zisserman A. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. International Conference on Learning Representations; 2014; 2014. [Google Scholar]

RESOURCES