Semantic Segmentation of Smartphone Wound Images: Comparative Analysis of AHRF and CNN-Based Approaches

AMEYA WAGH; SHUBHAM JAIN; APRATIM MUKHERJEE; EMMANUEL AGU; PEDER PEDERSEN; DIANE STRONG; BENGISU TULU; CLIFFORD LINDSAY; ZIYANG LIU

doi:10.1109/access.2020.3014175

. Author manuscript; available in PMC: 2020 Nov 27.

Published in final edited form as: IEEE Access. 2020 Aug 6;8:181590–181604. doi: 10.1109/access.2020.3014175

Semantic Segmentation of Smartphone Wound Images: Comparative Analysis of AHRF and CNN-Based Approaches

AMEYA WAGH ¹, SHUBHAM JAIN ¹, APRATIM MUKHERJEE ², EMMANUEL AGU ¹, PEDER PEDERSEN ¹, DIANE STRONG ¹, BENGISU TULU ¹, CLIFFORD LINDSAY ³, ZIYANG LIU ¹

PMCID: PMC7695230 NIHMSID: NIHMS1637061 PMID: 33251080

Abstract

Smartphone wound image analysis has recently emerged as a viable way to assess healing progress and provide actionable feedback to patients and caregivers between hospital appointments. Segmentation is a key image analysis step, after which attributes of the wound segment (e.g. wound area and tissue composition) can be analyzed. The Associated Hierarchical Random Field (AHRF) formulates the image segmentation problem as a graph optimization problem. Handcrafted features are extracted, which are then classified using machine learning classifiers. More recently deep learning approaches have emerged and demonstrated superior performance for a wide range of image analysis tasks. FCN, U-Net and DeepLabV3 are Convolutional Neural Networks used for semantic segmentation. While in separate experiments each of these methods have shown promising results, no prior work has comprehensively and systematically compared the approaches on the same large wound image dataset, or more generally compared deep learning vs non-deep learning wound image segmentation approaches. In this paper, we compare the segmentation performance of AHRF and CNN approaches (FCN, U-Net, DeepLabV3) using various metrics including segmentation accuracy (dice score), inference time, amount of training data required and performance on diverse wound sizes and tissue types. Improvements possible using various image pre- and post-processing techniques are also explored. As access to adequate medical images/data is a common constraint, we explore the sensitivity of the approaches to the size of the wound dataset. We found that for small datasets (< 300 images), AHRF is more accurate than U-Net but not as accurate as FCN and DeepLabV3. AHRF is also over 1000x slower. For larger datasets (> 300 images), AHRF saturates quickly, and all CNN approaches (FCN, U-Net and DeepLabV3) are significantly more accurate than AHRF.

Index Terms—: Wound image analysis, semantic segmentation, chronic wounds, U-Net, FCN, DeepLabV3, Associative Hierarchical Random Fields, Convolutional Neural Network, Contrast Limited Adaptive Histogram Equalization

I. Introduction

DIABETES Mellitus is a serious medical condition that affected 30.3 million people in 2017 [1]. About 15% of diabetes patients have chronic wounds in the US, which has a treatment cost of about $25 billion annually [2]. The majority of diabetic wounds are located in the lower extremities, may take years to heal, can re-occur and can adversely affect the physical and mental health of the patient if not treated by experts regularly.

Chronic wound care requires regular checkups by wound nurses who debride the wound, inspect its healing progress and recommend visits to wound experts when necessary. Accurate and timely care decisions are crucial for proper wound healing and delays in visiting a wound specialist could result in limb amputation. To reduce delays in care decisions, wound nurses often send remote wound images to experts for decisions on the best treatment options. Since 2011, our group has been researching and developing the Smartphone Wound Analysis and Decision-Support (SmartWAnDS) system, which can intelligently recommend wound care decisions by analyzing images of a patient’s wound and information in their Electronic Health Records (EHR), providing a second opinion for nurses working in remote locations. We envision that SmartWAnDS will standardize the quality of wound care even when the care is provided by nurses without wound expertise and reduce the workload of wound experts. We envision that SmartWAnDS could recommend when patients need visits to wound experts, provide healing scores or suggest minor changes in treatment. The SmartWAnDS system will be available as a smartphone app that can analyze wound images captured using the phone’s camera, and the patient’s EHR.

The visual characteristics of a wound that are useful in evaluating its health include its size, infection level, granulation tissue amount, necrotic tissue amount, slough and wound depth [3], [4] [5]. However, prior clinical studies have found a wound size to be the most important measure of its health [6]. For instance, the change in the size of a chronic wound in a 4-week period is an accurate predictor of whether the wound will heal or not [6]. Consequently, the segmentation step is an important step in most wound image analysis pipelines. The goal of our wound segmentation task is to label each pixel of a wound image into one of three semantic categories - wound, skin and background (also called semantic segmentation). Image segmentation has traditionally been performed using methods such as the Conditional Random Fields (CRF) and its variants such as the Associative Hierarchical Random Fields (AHRF). However, following the unprecedented success of Convolutional Neural Networks (CNNs) for image classification in 2012 (AlexNet) [7], CNNs have been found to outperform traditional methods for several computer vision tasks such as image classification [7], segmentation [8] and object detection [9].

Fully Connected Networks (FCN) [10], U-Net [11] and DeepLabV3 [8] are deep learning-based segmentation networks that have outperformed traditional image segmentation methods when given enough data. Wound image analysis has also recently started using deep learning for wound image classification and segmentation as seen in DeepWound [12] and DFUNet [13]. However, to the best of our knowledge, no systematic comparison between a deep learning approach and traditional (non-deep learning-based, graphical or CRF-based) techniques for wound image segmentation has been performed.

In this paper, we present a systematic and comprehensive comparison between Associative Hierarchical Random Fields (AHRF) and three deep learning based models (Fully Convolutional Networks (FCN), U-Net and DeepLabV3) for the task of wound image segmentation. We compare these approaches using a diverse set of performance metrics including segmentation accuracy (dice coefficient), sensitivity to the amount of training data utilized and model inference time. As real-world images and data of actual patients are often difficult to obtain in many medical applications, it is important to compare the performance of these methods with respect to the size of the training datasets. Deep learning methods are well known to be data intensive. We found that when the number of training images is small (< 300), AHRF (traditional) has a higher accuracy (dice coefficient) than U-Net but is still not as accurate as FCN and DeepLabV3 which were pre-trained on a subset of the COCO [14] dataset. As the number of training images increases, AHRF begins to saturate and the accuracy gap between AHRF and U-Net shrinks with U-Net eventually becoming more accurate than AHRF. FCN and DeepLabV3 consistently outperformed both U-Net and AHRF for all training set sizes. As we envision that our SmartWAnDS wound assessment system will eventually be deployed on a smartphone, we also examined the computational requirements of each method, inference time, and the need to communicate with a remote server.

The rest of this paper is organized as follows. Section II provides a brief background on the techniques used in this paper followed by the related work in image segmentation in Section III. The methodology used in this paper and a description of the wound image dataset utilized for training is located in Section IV. Sections V and VI present our results and a discussion of our major experiments and analyses of our findings. Finally, in Section VII, we conclude and suggest some directions for future work.

II. Background

We compared semantic segmentation of wound images using Associative Hierarchical Random Fields (AHRFs) and Convolutional Neural Networks (CNNs) for assigning a label of skin, wound or background to each pixel of an input image. Some background on both approaches are now presented.

A. Associative Hierarchical Random Fields (AHRFs)

Conditional Random Fields (CRFs) model data probalistically and have been found to be effective for various machine learning prediction tasks. AHRFs [15], a variant of CRFs leverage contextual data by considering other pixels in the neighbourhood of the target pixel to be classified, which works better than considering each pixel’s label in isolation. AHRFs model the conditional probability that a given pixel should be assigned a certain label, by considering the pixel itself as well as other pixels in its neighbourhood. An energy function consisting of unary, pairwise and higher order potentials is minimized to find the most optimal semantic labels for a given image. The unary potential takes features extracted from the target pixel as input and outputs a probability score for each target class. Pairwise potential ensures that nearby pixels that have similar features are assigned the same label. Higher order potentials are constructed such that pixels belonging to the same superpixels or cliques have the same label. Graph solving techniques are then used to minimise the energy and determine optimal labeling. Details about AHRF including the energy function minimized are presented in the Methodology section as Equation 1.

B. Convolutional Neural Networks (CNNs)

CNNs have been found quite effective for many computer vision tasks in recent years. They act as trainable image filters which can be used to convolve over images sequentially to measure responses or activations of the input image, creating feature maps. These feature maps are then stacked together, passed through non-linear functions, and further convolved with more filters. This convolution process has been found to be effective at extracting visual features or patterns in images that can be useful for tasks such as classification, segmentation, and super resolution. In this paper, we compare three CNN-based architectures for semantic segmentation: FCNs, DeeplabV3 and U-Net, which we now review briefly.

1). Fully Convolutional Network (FCN):

As they have generally performed well for per-pixel tasks, Long et al first proposed using FCNs trained end-to-end for semantic segmentation. FCN utilizes a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. FCNs have only locally connected layers, such as convolutions, pooling and upsampling, avoiding any densely connected layer. It also uses skip connections from it’s pooling layers to fully recover fine-grained spatial information which is lost during downsampling.

2). U-Net:

U-Net [11] is an encoder-decoder architecture that uses CNNs. Encoder-decoder networks, as the name suggests have two parts - an encoder and a decoder. The encoder is responsible for projecting the input feature vectors into a low dimensional space in which similar features lie close together. The decoder network takes features from this low dimensional space as input and attempts to recreate the original input features. Thus, the output of the encoder or conversely input of the decoder is called the bottleneck region where a low dimensional representation is present. Encoder-decoder networks have been found to be effective for various tasks such as image denoising, language translation and image segmentation.

3). DeepLabV3:

DeepLabV3 [8] utilizes atrous convolutions along with spatial pyramid pooling which enlarges the field of view of filters to incorporate larger context and controls the resolution of features extracted. Employing atrous convolutions in either cascade or in parallel captures multi-scale context due to the use of multiple atrous rates. DeepLabV3 uses a backbone network such as a ResNet [16] as its main feature extractor except that the last block is modified to use atrous convolutions with different dilation rates.

III. Related Work

A. Probabilistic Techniques for Wound Image Analysis

Prior to the rise in the popularity of deep learning, wound analysis mostly utilized probabilistic techniques such as color space manipulation [17] [18], machine learning classifiers using hand-crafted features [19], clustering techniques [20] and edge detection [21]. These probabilistic approaches generally have the advantage of not being very data intensive as they use hand-crafted features and shallow machine learning models. However, they fail to generalize well to new images captured in varied lighting conditions, skin and wound types. For the purpose of comparison with deep learning, in this paper, we use Associative Hierarchical Random Fields (AHRF) [15] as a probabilistic solution for image segmentation. AHRF uses region growing for connecting pixels that have similar visual features and also uses a combination of handcrafted and learned features for semantic segmentation of an image.

B. CNN-based Image Segmentation Techniques

Researchers have applied CNNs to biomedical applications such as wound segmentation using transfer learning [22], using lightweight mobile deep learning architectures (MobileNet) for wound segmentation [23], region proposal-based Faster R-CNN model for wound localization [24], and the inception module based CNN for classification of skin into healthy and abnormal [13]. These methods all try to segment wound pixels but do not distinguish the skin region from the background in the image. Li et al [25] proposed a method to segment out skin pixels using heuristics for thresholding and region growing as a first step, and then passed forward the cropped image with detected skin to the MobileNet CNN architecture for wound segmentation.

The downside to using neural networks is that they require large datasets to train from scratch which is not always available in applications that use medical or clinical data. This problem can be alleviated by using techniques such as data augmentation to increase variations in the existing data and transfer learning, which uses models that have been previously trained for similar vision tasks. The deep learning segmentation methods utilized in this paper were organized in two different ways. U-Net had separate classifiers for wound and skin while FCN and DeepLabV3 had just one classifier for both skin and wound. This enabled us compare whether the arrangement of classifiers affected the models performance.

IV. Methodology

A. Datasets of Wound Images

We gathered 3 different datasets as described below, which include diabetic foot ulcers, arterial, venous, pressure ulcers and surgical wounds. Many of the images exhibit typical wound attributes such as granulation, necrosis and slough. A wound annotation app (shown in Fig-2) was specifically created to expedite pixel-level annotations of wound and skin segments within the given images. The wound annotation app implemented the deep extreme cut algorithm [26], providing consistent wound annotation. Specifically, we did not rely on human labelers, which obviated the need for evaluating interrater reliability.

Fig. 2: — Annotation app with wound image view (left), preview of the mask after annotating the wound image (right)

Dataset 1 consists of 114 wound images captured with controlled lighting conditions. A wound imaging box was created [27] that simulated a consistent, homogeneous lighting environment. The segmentation masks consist of pixel-level labels where the red color corresponds to the wound segment, yellow corresponds to the skin segment and background is indicated by a green-colored mask 1.
Dataset 2 was gathered by scraping publicly available wound images from the internet. It consists of 202 images collected by scraping and 114 images from dataset 1, which yields a total of 316 images. This dataset has images with varying lighting conditions but the wounds were mostly captured from a relatively perpendicular angle.
Dataset 3 is the largest dataset with 1442 images in total, which was acquired from the vascular surgery department of the University of Massachusetts Medical Center. This dataset has images with large variations in lighting, viewing angles, wound types and skin texture.

Table-I shows the mean and standard deviation of the normalized values in the R,G,B channels. It can be observed that the standard deviation of the RGB values is less in dataset 1 as the images were captured using a wound box with controlled lighting and imaging distance, and increases for dataset 3. Table-II shows the image statistics of only wound and only skin pixels, obtained by cropping the image with the ground truth mask. The standard deviations are quite high for both wound and skin showing significant variations in our datasets. Table-II also shows the average percentage of wound and skin pixels within a wound image and their corresponding standard deviation. It can be seen that the average wound percentage is less than 10 % whereas skin covers almost 50 % creating class imbalance.

TABLE I:

Statistics of Dataset

Dataset	R Avg (Std)	G Avg (Std)	B Avg (Std)
Dataset 1	.535 (.144)	.533 (.142)	.529 (.141)
Dataset 2	.459 (.153)	.462 (.154)	.463 (.155)
Dataset 3	.472 (.172)	.472 (.172)	.473 (.173)

Open in a new tab

Mean and Standard Deviation of normalized images in R,G,B channel

TABLE II:

Statistics of Dataset

Dataset	R Avg (Std)	G Avg (Std)	B Avg (Std)	% Avg (Std)
D1-Wound	.475 (.080)	.273 (.099)	.232 (.089)	7.69 ( 7.64)
D2-Wound	.518 (.104)	.315 (.120)	.286 (.112)	11.03 (10.15)
D3-Wound	.515 (.106)	.310 (.118)	.260 (.109)	6.63 (7.95)

D1-Skin	.489 (.137)	.367 (.125)	.308 (.121)	56.43 (13.13)
D2-Skin	.565 (.143)	.414 (.135)	.392 (.133)	52.55 (15.22)
D3-Skin	.577 (.133)	.429 (.128)	.363 (.126)	47.74 (17.69)

Open in a new tab

Mean and Standard Deviation of normalized images in R,G,B channels cropped with wound and skin masks

B. Wound Image Pre-processing

In order to make our algorithms more robust to lighting variations and noisy imaging conditions, several pre-processing techniques were explored. Most of these techniques involved manipulating the images’ histograms in some form. The histogram is the probability distribution of pixel intensity values within an image, ranging from 0 to 255. After experimenting with the impact of many techniques on semantic segmentation accuracy such as image sharpening, histogram normalization, contrast enhancement, vignetting, gamma correction, reflectance, histogram matching and Contrast Limited Adaptive Histogram Equalisation (CLAHE), we found that CLAHE was consistently the most effective pre-processing technique.

Contrast Limited Adaptive Histogram Equalization (CLAHE):

CLAHE [28] is an image pre-processing technique based on adaptive histogram equalisation [29] which contextually equalizes the histogram of local image regions. Thus, the pixel’s intensity is transformed proportional to its rank of intensity among its neighbours defined by a kernel size. This technique was found to significantly enhance both the signal and noise components of an imag, which was not desired. CLAHE ensures that noise enhancement is reduced by using a contrast limiting factor called clip limit. This user defined limit is used as a maximum allowable local contrast enhancement factor. A grid search over the kernel size and clip limit was performed to obtain a kernel size of (24, 24) and clip limit of 3.0 as the most optimal hyperparameters for our dataset. An example of CLAHE pre-processing with our hyperparameters is shown in Fig-3.

Fig. 3: — An example image (left) with the Contrast Limited Adaptive Histogram Equalization Image (right)

C. Associative Hierarchical Random Field (AHRF)

Image segmentation using AHRF, a variant of CRF, consists of two parts: 1) calculating the energy value for an image given its pixel-wise labels, which considers both local features and similar neighboring pixels, 2) a graph solving approach, which tries to determine the optimal assignment of labels to an image such that its energy function is minimized. The mathematical formulation of AHRF is explained below. A high-level workflow of AHRF is also shown in Fig-4

Fig. 4: — AHRF implementation and workflow

Formulation:

Let us first define the following variables -

X = {X₁, X₂, …X_n} are the variables to be labelled

L = set of labels from which X_i are labeled

y_i = individual label given to X_i such that y_i ∈ L

M = number of paired training instances of the form ${x^{(i)}, y^{(i)}}_{i = 1}^{M}$

V = {1, 2, …n} set of valid vertices or indices of X

N = defined by sets N_i∀i ∈ V where N_i denotes the set of all neighbors of X_i

C = set of all cliques c where a clique X_c is a set of variables X that are similar and codependent such as super-pixels

y_c = labelling given to each clique c

Using the variables defined above, an AHRF formulation consists of an energy function E which is written as the sum of unary, pairwise and clique-wise potential as shown in equation 1 below.

E (y) = \underset{Unary Potential}{\underset{︸}{\sum_{i \in V} ϕ_{i} (y_{i}, θ_{u})}} + \underset{Pairwise potential}{\underset{︸}{\sum_{i \in V, j \in N_{i}} ϕ_{i j}^{p} (y_{i}, y_{j}, θ_{p})}} + \underset{Higher order Potential}{\underset{︸}{\sum_{c \in C} ϕ_{c}^{h} (y_{c})}}

(1)

In the above formulation, θ_u and θ_p are a set of parameters that are learned from the training paired samples ${x^{(i)}, y^{(i)}}_{i = 1}^{M}$ with the objective of maximizing the conditional distribution P (y|X). The higher order potential is described in equation 2 below.

ϕ_{c}^{h} (y_{c}) = \min_{i \in L} (γ_{c}^{\max}, γ_{c}^{l} + \sum_{i \in c} w_{i} k_{c}^{l} Δ (y_{i} \neq l))

(2)

where w_i is the weight of the variable x_i and each variable of a clique is penalized with a cost $w_{i} k_{c}^{l}$ if it has not taken the value of the dominant label of that clique. The value of penalty is truncated at $γ_{c}^{\max}$ . This formulation also supports higher order super-pixel based potentials across multiple scales of the image since it allows for cliques to take a free label in the case of multiple dominant labels and also considers relationships between cliques to increase contextual awareness. We have used mean shift segmentation to generate superpixels. Several different features have been used to calculate the AHRF potentials including textonBoost features on RGB and LAB colorspace, local binary patterns, Histogram of Oriented Gradients (HOG), SIFT features and color distribution features. Given the potential terms and parameters, the optimal labeling can be found by minimizing the overall energy using graph-cut based move making algorithms such as alpha expansion or alpha-beta-swap algorithm.

D. Semantic Segmentation Architectures using CNNs

1). Fully Convolutional Networks (FCNS):

FCNs differ from the classic CNNs used for image classification tasks. The CNN pipeline for image classification usually has a structure with several convolution layers followed by fully connected layers and outputs one predicted label per image. On the other hand, Long et al describe a Fully Connected Network (FCN) as one that uses only convolutions, pooling and activation functions and computes a nonlinear filter [10]. It achieved state-of-the-art segmentation on PASCAL VOC 2012 [30], NYUDv2 and SIFT Flow in 2015.

Classification networks can be converted into FCNs by eliminating the final classifier layers and appending a 1x1 convolution layer with a channel dimension equal to the number of classes to be predicted. This also allows the network to accept arbitrary sized images as input. This modification performs well on segmentation tasks but the output is coarse, which is remedied by adding skips that combine outputs from the lower layers with finer strides to generate the final prediction. This refines the output as local information from the lower layers makes the model pay attention to the global structure. Upsampling is required to fuse these outputs, which is done by deconvolution layers.

Network Structure:

We utilized ResNet101 [16] as the backbone of this network. The model consists of four layers followed by a classifier that segments the pixels into their respective classes. The four layers contain 3, 4, 23 and 3 bottleneck units respectively where each bottleneck consists of four convolution layers that are followed by a batch normalization step. The ReLU activation function is used after each bottleneck.

The third convolution layer in the bottleneck is a 3x3 convolutional operation while the rest are 1x1 convolutions. After the second layer, the bottleneck layers have an added dilation factor in the 3x3 convolutions for improving performance. The classifier consists of a 3x3 convolution followed by batch normalization and ReLU with dropout steps, ending with a 1x1 convolution with a channel dimension equal to the number of output classes.