Abstract.
Purpose
The development of accurate methods for retinal layer and fluid segmentation in optical coherence tomography images can help the ophthalmologists in the diagnosis and follow-up of retinal diseases. Recent works based on joint segmentation presented good results for the segmentation of most retinal layers, but the fluid segmentation results are still not satisfactory. We report a hierarchical framework that starts by distinguishing the retinal zone from the background, then separates the fluid-filled regions from the rest, and finally, discriminates the several retinal layers.
Approach
Three fully convolutional networks were trained sequentially. The weighting scheme used for computing the loss function during training is derived from the outputs of the networks trained previously. To reinforce the relative position between retinal layers, the mutex Dice loss (included for optimizing the last network) was further modified so that errors between more “distant” layers are more penalized. The method’s performance was evaluated using a public dataset.
Results
The proposed hierarchical approach outperforms previous works in the segmentation of the inner segment ellipsoid layer and fluid (Dice coefficient = 0.95 and 0.82, respectively). The results achieved for the remaining layers are at a state-of-the-art level.
Conclusions
The proposed framework led to significant improvements in fluid segmentation, without compromising the results in the retinal layers. Thus, its output can be used by ophthalmologists as a second opinion or as input for automatic extraction of relevant quantitative biomarkers.
Keywords: retina, optical coherence tomography, fluid segmentation, retinal layer segmentation, hierarchical framework, computer-aided diagnosis
1. Introduction
Optical coherence tomography (OCT) systems have revolutionized the clinical practice of ophthalmology, particularly in the diagnosis and follow-up of retinal diseases, because they provide close to an in vivo “optical biopsy” of the retina1 in a fast, accurate, and non-invasive way. Tissue reflectivity properties are inferred from the interference signal formed between the light reflected from a reference mirror and the light backscattered from the biological tissue.2 Contrary to other imaging techniques, OCT allows to get direct information about the integrity and morphology of the retinal layers (Fig. 1) and extract relevant biomarkers, such as the retinal thickness and the total area occupied by fluid. Such biomarkers are commonly used by ophthalmologists for disease detection and grading, as well as treatment planning and follow-up.
Fig. 1.
Example of an OCT B-scan showing signs of DME.3 The boundaries of the retinal layers and fluid-filled regions are identified with different colored lines. Retinal layers/membranes: ILM, inner limiting membrane; NFL, nerve fiber layer; GCL, ganglion cell layer; IPL, inner plexiform layer; INL, inner nuclear layer; OPL, outer plexiform layer; ONL, outer nuclear layer; ISM, inner segment myeloid; ISE, inner segment ellipsoid; OS, outer segment; RPE, retinal pigment epithelium; and BM, Bruch’s membrane.
Because the extraction of these biomarkers relies on the segmentation of retinal layers and fluid-filled regions in OCT images, and the manual segmentation is often a time-consuming and subjective process,4 several automated methods have been developed for supporting the ophthalmologists in this task. The first methods developed for this purpose were mainly based on Machine Learning algorithms. For instance, Chiu et al.3 applied a kernel regression-based classification method to estimate the fluid and retinal layer positions and then used a graph theory and dynamic programming framework for obtaining a more accurate segmentation of the retinal layer boundaries. A similar approach was proposed by Karri et al.5 In this case, structured random forests were applied to simultaneously identify individual layers and their corresponding edges. The spatial consistency between consecutive OCT B-scans was also improved by adding specific constraints to the dynamic programming-based segmentation method. A novel neutrosophic transformation and a graph-based shortest path method were also applied by Rashno et al.6 to segment fluid-filled regions. One of the main limitations of these methods is that they are not robust to large deformations in the retina [e.g., cases of advanced diabetic macular edema (DME)].
With the development of deep learning, convolutional neural networks (CNNs) started also to be applied to accomplish segmentation tasks. Inspired by Long et al.,7 who proposed a fully convolutional network, Ronneberger et al.8 developed the U-Net especially for biomedical image segmentation. Due to its training advantages, the U-Net and its variants, such as the ReLayNet,9 are the most commonly used networks for OCT image segmentation. The last one was specially designed for the joint segmentation of multiple retinal layers and fluid-filled regions.
Recently, methods that combine CNNs with conventional algorithms have also been proposed. For instance, Liu et al.4 integrated deep features, extracted by a residual network, and handcrafted features to train a structured random forests classifier. Alternatively, a regression-based segmentation approach using CNNs was proposed by Kepp et al.10 In this method, shapes were represented by signed distance maps, which assign to each pixel the distance to the next object contour. In turn, He et al.11 proposed a unified deep learning framework that models the distribution of the surface positions while deep features are being extracted. A novel anatomical-aware dual-branch cascaded deep neural network is also presented in the work of Ma et al.12 Some three-dimensional hybrid approaches have also been proposed for segmenting the retinal layers and fluid-filled regions simultaneously.13
Although most of the methods developed for retinal layer and fluid segmentation are trained and evaluated on normal OCT exams and on OCT exams with signs of DME, there are some methods designed to handle abrupt changes in attenuation coefficients within a layer and to analyze topology-disrupting anomalies that are typical of other pathologies (like central serous retinopathy and age-related macular degeneration).14,15
In recent years, several novel deep learning architectures have also been proposed uniquely for fluid segmentation.16–18 This may be a consequence of the RETOUCH challenge,19 which was organized in 2017 to promote research in fluid detection and segmentation in OCT volumes. In this challenge, all participating teams proposed deep learning-based methods, and most of them were variants of fully convolutional networks. The main differences between the proposed approaches were related to the pre/postprocessing applied, the need to segment (or not) the retinal zone previously and the training details.
During the design of a deep learning model, besides the network architecture, the loss function used for training the model has also a strong impact on its performance. Although the weighted cross-entropy (WCE) loss8 and the Dice loss (DL)20 are among the most used loss functions for the segmentation of retinal layers and fluid-filled regions in OCT images, they do not fully consider the relative position between the multiple retinal layers. Based on that, Wei and Peng21 proposed a modified DL, called mutex DL (MDL), to take into account the relationship among different layers. Although the authors adapted this loss to penalize the errors associated with the “fluid” class, they did not consider the “distance” between retinal layers’ classes.
In this work, a new hierarchical approach based on deep learning is proposed for retinal layer and fluid segmentation. It relies on the idea that a complex problem can be solved by a model whose training has been “guided” by the output of other models previously trained for accomplishing simpler tasks. This approach starts therefore by discriminating the retinal zone from the background, then, separates the fluid-filled regions from the rest and, in the end, differentiates the seven retinal layers considering the spatial relationship among them. Because the different retinal layers visualized in the OCT images appear in a specific order (which is related to the eye anatomy), a modification for the MDL function is also implemented, so that errors between more “distant” layers are more penalized during the network training. The aforementioned approach is also compared in terms of performance with the following two methods: (1) an alternative hierarchical method that, after discriminating the total retina from the background, separates the fluid-filled regions from the retinal layers and background, instead of simply splitting the fluid-filled regions from the remaining parts of the image and (2) a joint segmentation method that uses a single model for addressing this segmentation problem and, therefore, requires only one training step.
The main contributions of this work are as follows:
-
1.
a modified version of the MDL (MMDL) function that takes into account prior knowledge about the anatomical position of the several retinal layers;
-
2.
a novel hierarchical framework based on deep learning for fluid and retinal layer segmentation in OCT images.
2. Methodology
The three segmentation methods herein presented for retinal layer and fluid segmentation in OCT images are fully described in this section. While Sec. 2.1 presents a description of the joint segmentation approach (JSA), which segments the retinal layers and fluid-filled regions using a model in which training does not rely on the output of any other model, Sec. 2.2 describes the two proposed hierarchical segmentation approaches (HSAs). Both HSAs require sequential training of three different models, such that the output of the first trained models is used for guiding the training of the following one(s). The two variants differ in the task for which the second model is designed (fluid and retinal layer segmentation or just fluid segmentation), as well as in the way the loss function used for training the third model is computed.
The models’ architecture, the modifications made to the MDL, and the training details are also presented throughout this section.
2.1. Joint Segmentation of Retinal Layers and Fluid-Filled Regions
In this approach, the retinal layers and the fluid-filled regions are jointly segmented using a single neural network. Each image pixel is then associated with one out of nine image regions: background (0), fluid (1) or one of the retinal layers (2–8) (Fig. 2). The architecture chosen for this network corresponds to the one proposed by Wei and Peng21 (Fig. 3).
Fig. 2.
General scheme of the JSA. The colors of the boxes used for representing the output classes of the model are in agreement with the colors used for showing the final result of the model (image on the right).
Fig. 3.
Network architecture proposed by Wei and Peng.21
As a classical encoder-decoder architecture, it includes a contracting path, an expansive path, and skip connections between these two paths. Each encoder block of the contracting path consists of a convolutional layer followed by a pooling layer. In turn, the decoder blocks include an unpooling layer, a convolutional layer and a depth max pooling (DMP) layer, which extends the max pooling operation into the channel direction to get the maximum values between two feature maps (one from an encoder block and another from the corresponding decoder block) along that direction. The replacement of the “standard” concatenation by this DMP operation allows achieving a good trade-off between classification and localization accuracy with a smaller number of training parameters.
The loss function used for training the model included in the JSA [Eq. (1)] results from the combination of preexisting loss functions or their modified versions. More concretely, it corresponds to a weighted sum of the WCE loss [Eq. (2)], the DL [Eq. (4)], a MMDL and an additional term that regularizes the weights of convolutional layers (it relies on the Frobenius norm of the weights’ matrix, )
| (1) |
The WCE loss8 is computed as follows:
| (2) |
where represents the estimated probability of pixel belonging to class , is 1 or 0 depending on whether the true label of pixel is or not, is the weight assigned to that pixel, and is the number of pixels in the training set (). Assuming that is the true label of pixel , the value of is calculated using the expression
| (3) |
where represents the total number of class labels to be considered by the model, and is the number of pixels in the training set that belong to class . In this case, is equal to nine, as there are nine classes to discriminate.
In turn, the DL20 is given as
| (4) |
and it corresponds to one minus the Dice coefficient (DC), which measures the overlap between the predictions and the ground truth (GT). is a term used for avoiding division by zero.
Because the WCE loss8 and the DL20 do not consider explicitly the spatial relationship that exists among retinal layers, a MMDL is also included in Eq. (1). Unlike the authors of the original MDL, who left open the choice of the penalty factors that are applied to the different classification errors, we propose here a different way of choosing those values so that errors between more “distant” layers are more penalized during the network training.
The MMDL can then be defined as
| (5) |
where represents the factor that is used for penalizing an error between classes and , is 1 or 0 depending on whether the true label of pixel is or not, and is the probability of the pixel belonging to class . The value of is computed as follows:
| (6) |
where corresponds to the vertical distance between the pixel and a reference point belonging to class located in the same column as . As the retinal layers are numbered between two and eight according to their spatial position (from top to bottom – see Fig. 2), when , the reference point is the middle point of the corresponding layer in that specific column. For (background) or (fluid), the reference point is the point closest to pixel belonging to class . In the image columns where there is no background and/or fluid pixels, or the thickness of a retinal layer is equal to zero (i.e., no pixel in the column belongs to that layer), a reference point cannot be defined. In those cases, the distance is defined as the total number of rows of the OCT B-scans.
The penalty factors are calculated as follows:
| (7) |
, , and are constants that must be defined by the user and allow control of the penalization imposed on different classification errors.
As shown in Eq. (7), when a pixel belonging to a retinal layer () is incorrectly assigned by the model to another layer (), the greater the distance between and , the greater the penalty. When the pixels in the background () or in the fluid-filled regions () are incorrectly assigned to a retinal layer (), or the pixels belonging to a retinal layer are incorrectly assigned to the image background () or fluid (), the penalty is constant and equal to . For correct assignments (), the values of are set to 0.
Figure 4 shows a schematic representation of the weights included in the matrix by the researchers who introduced the MDL,21 side by side with a representation of the weights used in this research [computed using Eq. (7)]. Wei and Peng21 penalize all errors in the same way (weights equal to one outside the main diagonal), except the ones associated with “misclassification” of pixels as “fluid” (last column) or incorrect assignments made to fluid pixels (last row). In these two cases, they set the weight to 20. The matrix used by these authors [Fig. 4(a)] has also one more row and one more column than ours [Fig. 4(b)], because they divided the background class into two classes: the “region above the retina” and the “region below RPE.”
Fig. 4.
Visual representation of the weights included in the matrix [Eq. (7)]: (a) proposed by Wei and Peng21 and (b) herein proposed. In panel (a), the matrix has dimension (two classes associated with the image background-above and under the retinal zone-, seven classes associated with retinal layers, and one class corresponding to fluid). In panel (b), the matrix has dimension (one class related to background, one class associated with fluid, and seven classes corresponding to retinal layers).
Because this type of framework (dependent on a single model) has been widely used by the research community for accomplishing this segmentation task, it is herein applied to serve as a baseline for further comparisons.
2.2. Hierarchical Segmentation of Retinal Layers and Fluid-Filled Regions
The HSAs rely on the idea that a complex task can be split into several simpler (but not independent) tasks. Based on that, two different hierarchical frameworks are proposed (Fig. 5).
Fig. 5.
General scheme of the HSA: (a) V1 and (b) V2. The colors of the boxes used for representing the output classes of a model are in agreement with the colors used for showing the result of that model.
The HSA-V1 uses a first model for discriminating the background from the retinal zone, a second model for distinguishing the fluid-filled regions from the background and the retinal layers, and a third model for further identification of the multiple cell layers that coexist in the retina [Fig. 5(a)]. Because the background and fluid classes are not subdivided into “new” ones from one step to the next one(s), it is important to ensure that the segmentation results associated with these classes do not get worse as the complexity of the problem increases. This means that it is crucial to avoid that a pixel correctly identified by a model as “background” or “fluid” is later associated with another image region by a different model. For that purpose, the output of the first model () is used for conditioning the training of the second network (), so that errors in pixels that have been correctly identified as “background” by the model are strongly penalized during the training of the model . In turn, the output of the model conditions the training of the third model (), to prevent it from assigning a pixel correctly identified as “background” or “fluid” by the model to one of the retinal layers. This conditioning effect [represented by the red dashed lines in Fig. 5(a)] is applied by means of the weights assigned to each image pixel during the computation of the loss function used for training the target model.
Let be the number of pixels in the training set, the total number of class labels considered by model (), and the corresponding class label values. Then, the weight assigned to the pixel (belonging to class ), during the training of network , is
| (8) |
represents the number of pixels in class , considering the GT used for training the network (), and corresponds to the class label assigned to pixel by the model .
Please note that, during the training of the first network, pixels belonging to the same class have the same contribution to the computation of the loss function, and the weight assigned to them is inversely proportional to the frequency of that class in the corresponding training set. The same happens during the training of the last two networks for the pixels that belong to newly considered classes (e.g., classes 1 and 2 for model , and classes 2-8 for model ). For the remaining pixels, the weight is increased by a factor of two or four, depending on whether the class labels returned by the previous network for those pixels are in agreement with the GT or not [Eq. (8)]. The application of this multiplicative factor in the weights’ computation avoids misclassification errors in pixels that were correctly segmented before as “background” or “fluid,” and it also gives an additional focus to pixels where the previous model tends to fail.
Looking now to the HSA-V2, it presents several properties similar to the first one [Fig. 5(b)]. It is composed of three models, the first one (i.e., ) being equal to the , and the output of the first and second networks are also used for conditioning the training of the subsequent networks. However, in this case, the second model () only performs fluid segmentation. Thus, the “no fluid” class includes both pixels corresponding to the image background and retinal layers. Another difference relies on the fact that the output of the model conditions both the training of the and the models. In the HSA-V2, the weights assigned to the image pixels for the computation of the loss function used for training the model () are calculated as follows:
| (9) |
Please note that the meaning of the variables used in Eq. (9), as well as the motivations behind the formulas are the same as in Eq. (8).
In the case of HSAs, the global loss functions used for models’ optimization varied according to the task for which the models were designed (Table 1), and they also result from the combination of the loss functions mentioned in Sec. 2.1.
Table 1.
Loss functions used for optimizing the models included in the HSA. , , , and represent the weights of the different terms in the global loss functions.
| Approach | Model | Loss function used for the optimization | ||
|---|---|---|---|---|
| HSA-V1 |
1 |
|
||
| 2 |
|
|||
| 3 |
|
|||
| HSA-V2 | 4 |
|
||
| 5 |
|
|||
| 6 |
|
Through the analysis of Table 1, it is possible to observe that the functions used for training the models and [Eqs. (12) and (15)] are very similar to the function used for training the model of the JSA [Eq. (1)]. This is because all these models aim at discriminating the background pixels from fluid-filled regions and segmenting the multiple retinal layers. The only difference between the , , and relies on the term associated with the WCE loss. Although in the first one, the weights associated with each image pixel in the WCE loss are only dependent on the frequency of its class in the training set [Eq. (3)], in the and , the values associated with the WCE loss are conditioned by the outputs of the previous model(s) and calculated using Eqs. (8) and (9), respectively. This is the reason why it is referred to as conditioned WCE (cWCE).
Regarding models , , and , because they were designed for performing simpler tasks (that do not include the segmentation of the multiple retinal layers), the functions used for training those models do not include the term associated with the MMDL. For optimizing the models and , the original WCE loss is included in the loss function, because the weight assigned to each image pixel is not dependent on the output of any precedent model. In the case of and , the cWCE loss is used instead, because the weights assigned to image pixels depend on the output of models and , respectively. Those weights are calculated using Eq. (8) or Eq. (9), depending on whether it is or that is being computed.
The architecture of the models included in the HSA is the same as the one described in Sec. 2.1 for the model of the JSA. They only differ in the number of neurons in the output layer, which is related to the number of output classes.
3. Experiments
3.1. Dataset
The performance of the several segmentation approaches (JSA, HSA-V1 and HSA-V2) is evaluated on the DUKE dataset.3 This public dataset is composed of 110 annotated OCT B-scans acquired from 10 patients suffering from DME. For each patient, the 11 annotated images always include the B-scan centered at the fovea and 5 scans on either side of it. All images have and were annotated by two experts. As in most published works, the annotations of expert 1 are herein used as GT, and the annotations of expert 2 are used for comparing the performance of a human expert with the performance of automated methods in the segmentation of retinal layers and fluid-filled regions in OCT B-scans. Although the two experts annotated the same image columns in each B-scan, there are columns in the lateral regions of the images that are not annotated. Thus, the original images were cropped to discard those columns. Because the number of non-annotated columns varies from image to image, the size of the resulting cropped images (hereafter referred to as preprocessed images) is not always the same. As in the work of Wei and Peng,21 during training, we feed the network with image patches of dimension (which are vertically cropped from the preprocessed training images, without any overlapping), jointly with the corresponding masks; and, during inference, the whole preprocessed test images are given as input to the model. In this case, the variations in the size of the test images are not a problem, because the convolutional layers that constitute the fully convolutional networks take into consideration the input size and adjust the output dimensions accordingly.
As the original GT associated with a B-scan is provided as a binary mask of the fluid-filled regions and an matrix with the coordinates of the eight annotated retinal boundaries (Fig. 1), the GT had to be processed in order to generate the required segmentation masks (similar to the one presented in Fig. 2). For that, a matrix of zeros with the same size as the OCT B-scans () is created. Then, the binary mask of the fluid-filled regions is used for setting the fluid pixels to 1. Afterward, the values saved in the first seven rows of the matrix are used for identifying, in each column, the upper limit of the seven retinal layers, and the values saved in the last row are used for marking the lower limit of the last layer (i.e., the BM). All non-fluid pixels located between the upper limit of a layer and the upper limit of the next layer are assigned to the first one.
Taking into account that some models included in the evaluated segmentation approaches were designed for performing simpler tasks (e.g., models , , , and ), different GTs had also to be generated from the original segmentation masks for training these models. Table 2 shows how the original labels were combined to generate the “new” ones.
Table 2.
GT used for training the models included in the evaluated frameworks.
| Approach | Model | GT used for training | |
|---|---|---|---|
| GT labels | Original labels from which they were derived | ||
| HSA-V1 |
1 |
(background) | (background) |
|
(retina) |
(fluid + seven retinal layers) |
||
| 2 |
(background) | (background) | |
| (fluid) | (fluid) | ||
|
(retinal layers) |
(seven retinal layers) |
||
| HSA-V2 | 4 |
(background) | (background) |
|
(retina) |
(fluid + seven retinal layers) |
||
| 5 | (no fluid) | (background + seven retinal layers) | |
| |
|
(fluid) |
(fluid) |
| (background) | (background) | ||
| (fluid) | (fluid) | ||
| (NFL) | (NFL) | ||
| HSA-V1 | 3 | (GCL/IPL) | (GCL/IPL) |
| HSA-V2 | 6 | (INL) | (INL) |
| JSA | 0 | (OPL) | (OPL) |
| (ONL/ISM) | (ONL/ISM) | ||
| (ISE) | (ISE) | ||
| (OS/RPE) | (OS/RPE) | ||
3.2. Experimental Settings
Because the authors of the top-performing state-of-the-art method21 randomly selected eight subjects for training (total of 88 images) and used the remaining two subjects (a total of 22 images) for testing, we keep the same proportion here to make the comparisons as fair as possible. As those authors did not mention what subjects constitute the training and test sets, and we also want to evaluate the generalization ability of the proposed frameworks, a five-fold cross-validation is performed at the subject level. This means that each fold contains OCT B-scans acquired from 2 different subjects (total of 22 images), and there are no images of the same patient in 2 different folds. In each step, one fold is used as test set and the remaining ones constitute the training set.
Because it is necessary, during the training of the HSA, to obtain the output of the first two models (models and , in the case of HSA-V1; and models and in the case of HSA-V2), for all training images, before the training of the last one, an independent subject-wise eight-fold cross-validation (one patient per fold) is performed, within each cycle of the five-fold cross-validation procedure, for getting the result associated with each training image.
Regarding the learning process, the hyperparameters of the global loss functions [Eq. (1), and Eqs. (10)–(15)] were set to , and . To compute the values of in the MMDL function [Eqs. (5) and (7)], , and were defined as 10, 3 and 9, respectively. The smoothing term used for calculating the DL [Eq. (4)] was set to 1. In turn, the momentum value applied to compute the moving average of the mean and standard deviation of the training batches, in the batch normalization, was set to 0.95. The stochastic gradient descent was the optimizer used for finding the optimal sets of weights (). The learning rate was initially defined as 0.1 and it was reduced by half every 10 epochs. Due to the size of the OCT B-scans, batches of 16 images are used for updating the models’ weights at each iteration. The maximum number of epochs was set to 100, as it was enough for loss function convergence. The values of all aforementioned hyperparameters were selected based on an exhaustive analysis of their impact on the performance of the model included in the JSA for five independent validation sets (in each step of the five-fold cross-validation procedure, one fold out of the four dedicated to training was selected for tuning these hyperparameters). These values were then kept unchanged for training all models included in the HSA.
To decrease the effect of data scarcity, the number of training examples was increased through data augmentation. The types of transformations allowed were horizontal flips, slight spatial translations ( of the image’s width), rotations (), and vertical clippings of the lateral regions (cropped columns of the total number of image columns).
Regarding the networks’ architecture, it consists of four encoder blocks, one bottleneck block, and four decoder blocks. The choice of the number of blocks to include was made based on the results presented by Wei and Peng.21
Although the three models included in the two hierarchical approaches must be trained sequentially [due to the conditioning effect that the output of the first models has on the training of the following one(s)], during the test phase they can be run in any order. Furthermore, the use of model or model for segmenting the retinal layers and the fluid-filled regions in a test image does not require the previous application of models and , or models and .
3.3. Comparative Studies
In this work, different types of comparisons are made. Firstly, we start by analyzing the effect of using the MMDL21 in the loss function. For that, the model embedded in the JSA is trained using two different loss functions: the original one proposed by Wei and Peng21 and the modified version of it herein proposed [Eq. (5)].
The use of different training frameworks is also evaluated. While in the JSA the training phase involves the training of a single model, responsible for segmenting the retinal layers and the fluid (Fig. 2), in the HSA, two models designed for performing simpler tasks are trained before the last one (which is similar to the model included in the JSA) and their output conditions the weights to be assigned to each image pixel for the computation of the loss function during the training of the last model (Fig. 5). Regarding the HSA, two variants are tested. In one of them (HSA-V1), the second model is responsible for separating the fluid-filled regions from the retinal layers and background (multiclass segmentation); in the other one (HSA-V2), the second model segments the background and the retinal layers as a whole and the fluid regions as another class (binary segmentation).
The performance of the designed JSA and HSAs is also compared with those of the top-performing state-of-the-art methods.4,9,11,21 Please note that only the method proposed by He et al.11 was implemented and evaluated using exactly the same scheme as the proposed segmentation frameworks. The performances of the other state-of-the-art methods were kept from the original papers. This was done because the results presented by He et al.11 were computed using a train/test split different from the one used in the other works.
4. Results and Discussion
To compare the performance of the different segmentation approaches, the DC is computed for each one of the following classes: fluid, nerve fiber layer (NFL), ganglion cell/inner plexiform layer (GCL/IPL), inner nuclear layer (INL), outer plexiform layer (OPL), outer nuclear layer/inner segment myeloid (ONL/ISM), inner segment ellipsoid (ISE) and outer segment/retinal pigment epithelium (OS/RPE) (see Fig. 1). The DC associated with a class is defined as
| (16) |
where TP is the total number of pixels that are correctly assigned to class , FP is the total number of pixels that are incorrectly assigned to class and FN is the total number of pixels belonging to class that are incorrectly assigned to another class.
Because we use five-fold cross-validation for evaluating the performance of the several segmentation approaches in different test sets, the quantitative results are herein presented in the form of the arithmetic mean and standard deviation of the DC values across the five folds. To verify if the differences in the mean DC obtained by two different approaches are statistically significant, a Wilcoxon signed-rank test is also performed with a significance level () equal to 0.05.
To quantify the changes on the boundaries of the segmented retinal layers when different loss functions (MDL and MMDL-based loss functions) are used for training the JSA, the mean absolute distance (MAD) is also computed for the eight retinal boundaries: inner limiting membrane (ILM), NFL/GCL, IPL/INL, INL/OPL, OPL/ONL, ISM/ISE, OS/RPE and Bruch’s membrane (BM) (see Fig. 1). For a given image, the MAD between the detected retinal boundary and the GT is defined as
| (17) |
where is the number of image columns touched by the boundary both in the GT and in the segmentation result, is the row where the boundary appears in the column of the segmentation result and is the row where the boundary appears in column of the GT. After computing this metric for all boundaries and images, the mean and standard deviation are obtained. To check the statistical significance of the difference among the means for the whole image set, when different loss functions are used for training the JSA, a Wilcoxon signed-rank test is also performed with a significance level equal to 0.05.
4.1. Effect of Using the MDL Versus MMDL-Based Function for Training the JSA
Table 3 shows how the modification of the loss function used in the training phase affects the performance of the JSA, in terms of the DC mean. Although the modification introduced in the loss function (more specifically, in the term associated with the mutex relationship among different layers) does not lead to significant improvements in the segmentation of the INL and OPL layers, it improves considerably the segmentation of the remaining five layers and fluid-filled regions. This proves therefore that the application of different penalties to segmentation errors involving different layers (according to its spatial position within the retina), combined with a strong penalization of segmentation errors associated with classes “background” and “fluid,” gives rise to better results than the application of a strong penalty for errors involving the class “fluid” and lower fixed penalties on the remaining errors.
Table 3.
Results achieved by the JSA when different loss functions (MDL-based or MMDL-based) are used for training, in terms of mean DC.
| Approach | NFL | GCL/IPL | INL | OPL | ONL/ISM | ISE | OS/RPE | Fluid |
|---|---|---|---|---|---|---|---|---|
| JSA + MDL |
0.89 | 0.91 | 0.85 | 0.80 | 0.93 | 0.92 | 0.89 | 0.72 |
| (0.03) |
(0.01) |
(0.03) |
(0.02) |
(0.03) |
(0.02) |
(0.03) |
(0.01) |
|
| JSA + MMDL |
0.90 † | 0.92 † | 0.85 | 0.82 | 0.95 † | 0.93 † | 0.90 † | 0.75 † |
| (0.03) |
(0.02) |
(0.02) |
(0.04) |
(0.02) |
(0.02) |
(0.02) |
(0.02) |
|
| Expert 2 | 0.86 | 0.90 | 0.79 | 0.74 | 0.94 | 0.86 | 0.82 | 0.58 |
The values presented in the second and third rows correspond to the mean DC obtained for each class across the five folds. The values within brackets are the corresponding standard deviations. The † symbol is used for highlighting the cases where the difference between the mean DC obtained by the two approaches (JSA + MDL and JSA + MMDL) is statistically significant ().
In turn, Table 4 also demonstrates that the boundaries of the segmented retinal layers are, on average, closer to their GT position, when the MMDL-based loss function is used.
Table 4.
Results achieved by the JSA when different loss functions (MDL-based or MMDL-based) are used for training, in terms of the MAD between the boundaries of the segmented layers and the GT.
| Approach | ILM | NFL/GCL | IPL/INL | INL/OPL | OPL/ONL | ISM/ISE | OS/RPE | BM |
|---|---|---|---|---|---|---|---|---|
| JSA + MDL |
1.076 | 1.735 | 1.824 | 1.890 | 2.446 | 1.023 | 1.084 | 1.056 |
| (0.939) |
(1.814) |
(3.881) |
(3.400) |
(3.786) |
(1.353) |
(1.055) |
(1.029) |
|
| JSA + MMDL |
1.024 † | 1.658 † | 1.819 | 1.871 | 2.276 † | 0.980 † | 1.058 † | 1.033 † |
| (0.792) |
(1.684) |
(3.862) |
(2.740) |
(3.705) |
(1.150) |
(0.998) |
(0.964) |
|
| Expert 2 | 1.266 | 1.802 | 2.041 | 2.124 | 2.323 | 1.235 | 1.281 | 1.258 |
| (1.097) | (1.874) | (3.905) | (3.729) | (3.894) | (1.164) | (1.171) | (1.084) |
The values presented in the second and third rows correspond to the MAD between the boundaries of the segmented retinal layers and the manual annotations of expert 1 (GT) along each column. The values within brackets are the corresponding standard deviations. The † symbol is used for highlighting the cases where the difference between the MAD obtained for the two approaches (JSA + MDL and JSA + MMDL) is statistically significant ().
As observed in Fig. 6, the modification applied to the loss function is translated into rougher (and more realistic) boundaries between layers and fewer regions incorrectly segmented as fluid.
Fig. 6.

Segmentation results returned by two variants of the JSA for an OCT B-scan showing signs of DME (i.e., fluid-filled regions). The two variants only differ in the loss functions used for training the model. One of them uses the original MDL-based loss,21 and the other one uses the proposed MMDL-based function [Eq. (5)]. The original image and the expert annotations are also shown to facilitate the qualitative evaluation of the results. The meaning of the colors is the same as in Fig. 2.
4.2. Joint Versus Hierarchical Training Framework
Because the previous experiments showed that better results are achieved when the MMDL-based loss function is used for training, all approaches compared below were trained using this loss function.
Table 5 presents the performances of the three segmentation approaches herein evaluated: the JSA (baseline) and the two proposed HSAs. Through the observation of the mean DC values computed for the seven retinal layers and fluid, it is possible to conclude that both the JSA and the two HSAs outperform expert 2 in this segmentation task. The results of the statistical test also indicate that there are statistically significant improvements in the segmentation of the fluid, as well as in the NFL, GCL/IPL and ISE layers, when the HSA-V1 is applied instead of the JSA. In contrast, the differences observed in the mean DC values computed for the central layers of the retina as well as for the OS/RPE layer are not statistically significant. This may be associated with the fact that the fluid segmentation is very difficult in some images, and it usually appears in the middle of these retinal layers. Regarding the HSA-V2, it presents considerable improvements, when compared with the JSA, for all classes except for the INL-ISM and the OS/RPE layers. When compared with the HSA-V1, the HSA-V2 achieves better results mainly in the segmentation of the central layers of the retina (INL, OPL, and ONL/ISM) and fluid-filled regions. This may be associated with the fact that, in the HSA-V1, the output of the first model only conditions the training of the second model, not affecting the training of the last network. Furthermore, the second model is not only dedicated to segmenting the fluid-filled regions as it is also focused on discriminating the background from the retinal zone. The increased complexity of the problem to be solved by the second model may be therefore the cause of the differences obtained between the two HSAs.
Table 5.
Comparison of the performances of the JSA and the two proposed HSAs (V1 and V2), in terms of mean DC. For both approaches, the MMDL-based loss function is used for training the models.
| Approach | NFL | GCL/IPL | INL | OPL | ONL/ISM | ISE | OS/RPE | Fluid |
|---|---|---|---|---|---|---|---|---|
| JSA |
0.90 | 0.92 | 0.85 | 0.82 | 0.95 | 0.93 | 0.90 | 0.75 |
| (0.03) |
(0.02) |
(0.02) |
(0.04) |
(0.02) |
(0.02) |
(0.02) |
(0.02) |
|
| HSA-V1 |
0.92† | 0.93† | 0.86 | 0.83 | 0.95 | 0.94† | 0.90 | 0.79† |
| (0.03) |
(0.03) |
(0.01) |
(0.04) |
(0.01) |
(0.02) |
(0.01) |
(0.01) |
|
| HSA-V2 |
0.92 † | 0.95 † | 0.88 †‡ | 0.86 †‡ | 0.96 ‡ | 0.95 † | 0.91 | 0.82 †‡ |
| (0.02) |
(0.02) |
(0.01) |
(0.02) |
(0.01) |
(0.03) |
(0.03) |
(0.01) |
|
| Expert 2 | 0.86 | 0.90 | 0.79 | 0.74 | 0.94 | 0.86 | 0.82 | 0.58 |
The values presented for the automated methods correspond to the mean DC obtained for each class across the five folds. The values within brackets are the standard deviations. The † symbol is used for highlighting the cases where the difference between the mean DC achieved by the HSA and JSA is statistically significant, while the ‡ symbol is used for highlighting the cases where the difference between the mean DC obtained by the HSA-V1 and HSA-V2 is statistically significant ().
Figures 7 and 8 show the segmentation results obtained by the three tested approaches for two OCT B-scans. One of them corresponds to a normal B-scan (i.e., no fluid is present) and the other one corresponds to a pathological case (i.e., fluid is present). As observed in Fig. 7, there is a small dark region in the normal OCT B-scan (inside the white box) that is incorrectly segmented as fluid by the JSA. Although the HSA-V1 also segments part of this region as fluid, the area of the incorrectly segmented region is much smaller. Furthermore, the NFL (red) and the GCL/IPL (green) layers are better segmented by the HSA-V1 than the JSA (see the transition between these two layers). In the result of the HSA-V2, the several layers are well segmented and no fluid is detected. Thus, it corresponds to the best result when compared to GT.
Fig. 7.
Segmentation results returned by the three evaluated segmentation approaches for a normal OCT B-scan (i.e., no fluid is present). The original image and the expert annotations are also shown to facilitate the qualitative evaluation of the results. The meaning of the colors is the same as in Fig. 2.
Fig. 8.
Segmentation results returned by the three evaluated segmentation approaches for an OCT B-scan where fluid is present. The original image and the expert annotations are also shown to facilitate the qualitative evaluation of the results. The meaning of the colors is the same as in Fig. 2.
Figure 8 shows how different the strategies followed by the two human experts for segmenting the fluid-filled regions are. While expert 1 tends to undersegment these regions, expert 2 tends to oversegment them. Furthermore, there are some fluid-filled regions only segmented by one of them (indicated by the upwards white arrow). As the proposed approaches have been trained using the annotations of expert 1, adjacent fluid areas are commonly segmented as separate regions by the automated methods, rather than being unified into a single one. For the OCT B-scan shown in Fig. 8, all image regions identified as fluid by the JSA are segmented as such by at least one of the experts. However, there is a small dark region on the left side of the B-scan (inside the white box) that was segmented as fluid by the two experts and the JSA does not consider it as such. When compared with the JSA, the segmentation result obtained by the HSA-V1 presents the same limitation, but, in this case, the upper retinal layers are noticeably better segmented. Unlike previous approaches, the HSA-V2 correctly segments that small region on the left side of the image as fluid. The remaining fluid-filled regions identified by at least one expert are also well segmented. It is still possible to observe that the INL (dark green) and OPL (orange) layers are also better segmented by the HSA-V2.
4.3. Comparison with the Top-Performing State-of-the-Art Methods
Table 6 shows the quantitative results achieved by the HSA-V2 (best-performing approach among the proposed ones) side by side with the performances of the four top-performing state-of-the-art methods.
Table 6.
Comparison of the performance of the HSA-V2 and the top-performing state-of-the-art methods.
| Approach | NFL | GCL/IPL | INL | OPL | ONL/ISM | ISE | OS/RPE | Fluid |
|---|---|---|---|---|---|---|---|---|
| Roy et al.9★ |
0.90 |
0.94 |
0.87 |
0.84 |
0.93 |
0.92 |
0.90 |
0.77 |
| Liu et al.4★ |
0.92 |
|
|
|
0.96 |
|
0.93
|
|
| Wei et al.21★ |
0.91 |
0.95 |
0.88 |
0.86 |
0.95 |
0.92 |
0.88 |
0.81 |
| He et al.11 |
0.91 | 0.95 | 0.87 | 0.85 | 0.95 | 0.93 | 0.90 | 0.79 |
| (0.01) |
(0.02) |
(0.01) |
(0.01) |
(0.02) |
(0.01) |
(0.03) |
(0.02) |
|
| HSA-V2 |
0.92 | 0.95 | 0.88 | 0.86 | 0.96 | 0.95 | 0.91 | 0.82 |
| (0.02) |
(0.02) |
(0.01) |
(0.02) |
(0.01) |
(0.03) |
(0.03) |
(0.01) |
|
| Expert 2 | 0.86 | 0.90 | 0.79 | 0.74 | 0.94 | 0.86 | 0.82 | 0.58 |
The values presented for the HSA-V2 approach and the method proposed by He et al.11 correspond to the mean DC obtained for each class across the five folds. The values within brackets are the standard deviations. The ★ symbol is used for highlighting the cases where the results are presented as reported in the original papers.
The HSA-V2 outperforms the other methods in the segmentation of the ISE layers and fluid-filled regions. For the remaining retinal layers, its performance is at the level of the state-of-the-art methods. Only for the OS/RPE layer is the obtained mean DC lower than the best one. Nevertheless, it is the second-best result.
5. Conclusion
Although the segmentation of the retinal layers and fluid-filled regions in the OCT B-scans is a crucial task for the diagnosis and follow-up of several chorioretinal diseases, this is a very challenging, time-consuming and subjective task, as demonstrated by the results obtained when the manual annotations of two human experts are compared.
In this work, three different approaches are proposed for segmenting the retinal layers and fluid-filled regions in OCT images, and their results are evaluated on the publicly available DUKE dataset. A modification to the MDL,21 MMDL, was also designed to account for the anatomical position of the multiple retinal layers. The obtained results demonstrate that the use of the MMDL-based loss function not only leads to a better segmentation of the fluid but also approximates the boundaries of the segmented retinal layers to the GT positions.
One of the segmentation approaches (JSA) is based on the work of Wei and Peng21 and performs the joint segmentation of the fluid and retinal layers using a single model, being the main difference the use of MMDL. The other two (HSA-V1 and HSA-V2) are hierarchical approaches using three models trained sequentially, where the first two models perform simpler tasks and their outputs impose conditions on the training of the subsequent model(s). They mainly differ in the task the second model is designed for. The best results are obtained using the HSA-V2 approach, which includes a first model to discriminate the background from the total retina, a second model to exclusively segment the fluid areas and a third model for segmenting the seven retinal layers and fluid. Because this approach outperforms the human expert in the segmentation of all target regions, and presents better results than the top-performing state-of-the-art methods for the classes “ISE” and “fluid,” without deteriorating the segmentation of the remaining layers, it is suitable for clinical use as a second opinion or as input for automatic extraction of quantitative biomarkers (e.g., retinal thickness and area occupied by fluid).
As future work, it would be important to evaluate the performance of this method in other datasets, preferably with a higher number of annotated OCT B-scans and greater pathology diversity.
Acknowledgments
This work is financed by National Funds through the FCT - Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) and by the ESF - European Social Fund through the 2014-2020 North Portugal Regional Operational Programme (NORTE 2020) within grant SFRH/BD/145329/2019.
Biography
Biographies of the authors are not available.
Disclosures
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Contributor Information
Tânia Melo, Email: tania.f.melo@inesctec.pt.
Ângela Carneiro, Email: amvgcarneiro@gmail.com.
Aurélio Campilho, Email: campilho@fe.up.pt.
Ana Maria Mendonça, Email: amendon@fe.up.pt.
References
- 1.Adhi M., Duker J., “Optical coherence tomography – current and future applications,” Curr. Opin. Ophthalmol. 24, 213–221 (2013). 10.1097/ICU.0b013e32835f8bf8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Marschall S., et al. , “Optical coherence tomography-current technology and applications in clinical and biomedical research,” Anal. Bioanal. Chem. 400, 2699–2720 (2011). 10.1007/s00216-011-5008-1 [DOI] [PubMed] [Google Scholar]
- 3.Chiu S., et al. , “Kernel regression based segmentation of optical coherence tomography images with diabetic macular edema,” Biomed. Opt. Express 6(4), 1172–1194 (2015). 10.1364/BOE.6.001172 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Liu X., et al. , “Automated layer segmentation of retinal optical coherence tomography images using a deep feature enhanced structured random forests classifier,” IEEE J. Biomed. Health Inf. 23, 1404–1416 (2019). 10.1109/JBHI.2018.2856276 [DOI] [PubMed] [Google Scholar]
- 5.Karri S., Chakraborthi D., Chatterjee J., “Learning layer-specific edges for segmenting retinal layers with large deformations,” Biomed. Opt. Express 7, 2888–2901 (2016). 10.1364/BOE.7.002888 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Rashno A., et al. , “Fully automated segmentation of fluid/cyst regions in optical coherence tomography images with diabetic macular edema using neutrosophic sets and graph algorithms,” IEEE Trans. Biomed. Eng. 65, 989–1001 (2018). 10.1109/TBME.2017.2734058 [DOI] [PubMed] [Google Scholar]
- 7.Long J., Shelhamer E., Darrell T., “Fully convolutional networks for semantic segmentation,” in IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), pp. 3431–3440 (2015). 10.1109/CVPR.2015.7298965 [DOI] [PubMed] [Google Scholar]
- 8.Ronneberger O., Fischer P., Brox T., “U-net: convolutional networks for biomedical image segmentation,” Lect. Notes Comput. Sci. 9351, 234–241 (2015). 10.1007/978-3-319-24574-4_28 [DOI] [Google Scholar]
- 9.Roy A., et al. , “RelayNet: retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional network,” Biomed. Opt. Express 8(8), 3627–3642 (2017). 10.1364/BOE.8.003627 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kepp T., et al. , “Topology-preserving shape-based regression of retinal layers in OCT image data using convolutional neural networks,” in IEEE 16th Int. Symp. Biomed. Imaging (ISBI 2019), pp. 1437–1440 (2019). 10.1109/ISBI.2019.8759261 [DOI] [Google Scholar]
- 11.He Y., et al. , “Structured layer surface segmentation for retina OCT using fully convolutional regression networks,” Med. Image Anal. 68, 101856 (2021). 10.1016/j.media.2020.101856 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ma D., et al. , “Lf-unet – a novel anatomical-aware dual-branch cascaded deep neural network for segmentation of retinal layers and fluid from optical coherence tomography images,” Comput. Med. Imaging Graph. 94, 101988 (2021). 10.1016/j.compmedimag.2021.101988 [DOI] [PubMed] [Google Scholar]
- 13.Montuoro A., et al. , “Joint retinal layer and fluid segmentation in OCT scans of eyes with severe macular edema using unsupervised representation and auto-context,” Biomed. Opt. Express 8, 1874 (2017). 10.1364/BOE.8.001874 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Novosel J., et al. , “Locally-adaptive loosely-coupled level sets for retinal layer and fluid segmentation in subjects with central serous retinopathy,” in IEEE 13th Int. Symp. Biomed. Imaging (ISBI), pp. 702–705 (2016). 10.1109/ISBI.2016.7493363 [DOI] [Google Scholar]
- 15.Melinščak M., et al. , “Annotated retinal optical coherence tomography images (AROI) database for joint retinal layer and fluid segmentation,” Automatika 62, 375–385 (2021). 10.1080/00051144.2021.1973298 [DOI] [Google Scholar]
- 16.Sappa L. B., et al. , “RetFluidNet: retinal fluid segmentation for SD-OCT images using convolutional neural network,” J. Digit. Imaging 34(3), 691–704 (2021). 10.1007/s10278-021-00459-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Liu X., et al. , “Automatic fluid segmentation in retinal optical coherence tomography images using attention based deep learning,” Neurocomputing 452, 576–591 (2021). 10.1016/j.neucom.2020.07.143 [DOI] [Google Scholar]
- 18.Lin M., et al. , “Recent advanced deep learning architectures for retinal fluid segmentation on optical coherence tomography images,” Sensors 22(8), 3055 (2022). 10.3390/s22083055 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Bogunović H., et al. , “RETOUCH: the retinal OCT fluid detection and segmentation benchmark and challenge,” IEEE Trans. Med. Imaging 38, 1858–1874 (2019). 10.1109/TMI.2019.2901398 [DOI] [PubMed] [Google Scholar]
- 20.Milletari F., Navab N., Ahmadi S.-A., “V-Net: fully convolutional neural networks for volumetric medical image segmentation,” in Fourth Int. Conf. 3D Vis. (3DV), pp. 565–571 (2016). 10.1109/3DV.2016.79 [DOI] [Google Scholar]
- 21.Wei H., Peng P., “The segmentation of retinal layer and fluid in SD-OCT images using mutex Dice loss based fully convolutional networks,” IEEE Access 8, 60929–60939 (2020). 10.1109/ACCESS.2020.2983818 [DOI] [Google Scholar]







