Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2022 Sep 21.
Published in final edited form as: IEEE Trans Med Robot Bionics. 2022 May;4(2):335–338. doi: 10.1109/TMRB.2022.3170215

Simultaneous Depth Estimation and Surgical Tool Segmentation in Laparoscopic Images

Baoru Huang 1,2,, Anh Nguyen 1,5, Siyao Wang 1, Ziyang Wang 3, Erik Mayer 2, David Tuch 4, Kunal Vyas 4, Stamatia Giannarou 1,2, Daniel S Elson 1,2
PMCID: PMC7613616  EMSID: EMS153972  PMID: 36148137

Abstract

Surgical instrument segmentation and depth estimation are crucial steps to improve autonomy in robotic surgery. Most recent works treat these problems separately, making the deployment challenging. In this paper, we propose a unified framework for depth estimation and surgical tool segmentation in laparoscopic images. The network has an encoder-decoder architecture and comprises two branches for simultaneously performing depth estimation and segmentation. To train the network end to end, we propose a new multi-task loss function that effectively learns to estimate depth in an unsupervised manner, while requiring only semi-ground truth for surgical tool segmentation. We conducted extensive experiments on different datasets to validate these findings. The results showed that the end-to-end network successfully improved the state-of-the-art for both tasks while reducing the complexity during their deployment.

Index Terms: Deep learning, Self-supervised depth estimation, Surgical instrument segmentation, Multi-task learning

I. INTRODUCTION

MINIMALLY invasive surgery (MIS), including robot-assisted procedures, provides significant advantages such as reducing operative trauma and the risk of infection. Advanced robotic surgery systems such as the da Vinci surgical platform [1] allow multiple types of information to be integrated together with effective feedback to the surgeon. However, interpreting visual surgical data is complex and involves many tasks such as tissue deformation modeling [2], tool tracking [3], and scene depth estimation [4].

In recent years there has been much work on depth estimation and surgical tool segmentation. Notably, learning-based algorithms have shown excellent prediction capability of the relationship between color images and depth, as well as image segmentation into meaningful regions. These depth-predicting algorithms may use monocular or stereo input data, with either supervised, self-supervised or unsupervised [5] training approaches depending on availability of ground truth labels. Instrument segmentation may also use supervised or unsupervised methods [6]. Knowing the tissue depth and the instrument masks could facilitate tissue scanning [7] or dynamic image overlays [8], which are useful for laparoscopic surgery.

To date, depth estimation and surgical tool segmentation have been mainly treated as separate challenges, requiring time-consuming sequential task completion. In this work, we propose a novel unified framework that can perform simultaneous depth estimation and surgical tool segmentation: SDSNet. Our method does not require manually labeled ground truth, and achieves the state-of-the-art performance for both tasks, as well as reducing the deployment complexity.

II. RELATED WORK

Depth Estimation

Most existing methods treat depth estimation as a supervised regression problem [9], however, collecting per-pixel ground truth for laparoscopic imaging is challenging. To overcome this limitation, Liu et al. [10] introduced a self-supervised algorithm for dense depth estimation in stereo endoscopy. The authors in [11] proposed a geometry-aware network for motion estimation. By enforcing consistency between left and right RGB images, Godard et al. [12] produced results that outperformed contemporary supervised methods.

Surgical Tool Segmentation

Semantic segmentation of robotic instruments has also attracted a lot of attention in robot-assisted surgery research. Some discriminative models such as Naive Bayesian classifiers [13] and maximum likelihood Gaussian Mixture Models [14] can be trained on color features. More recently, the state of the art has increasingly focused on fully convolutional neural networks. The authors in [15] used CNN for segmenting robotic tools.

Simultaneous Depth Estimation and Segmentation

The task of depth estimation and segmentation are usually tackled separately, with few works unifing both tasks, especially for laparoscopy. Most recent methods used RGB images as the training data, for instance in EdgeStereo [16] the authors in-corporated edge detection to accurately estimate depth changes across object boundaries. In medical imaging, self-supervised depth estimation was used to regularize the semantic segmentation during knee arthroscopy [17].

III. METHODOLOGY

An overview of our proposed SDSNet can be found in Fig. 1. We first combined the depth estimation and tool segmentation tasks by sharing an encoder network, where essential geometric features from the input images were extracted. After the encoder, the features flowed separately to two branches (segmentation and depth estimation). By forcing the disparity map to generate a reconstructed input image that is consistent with the original, we could derive an accurate disparity map for depth inference.

Fig. 1.

Fig. 1

The detailed architecture of SDSNet. The depth branch and the segmentation branch share the same encoder network. The features of the third convolutional layer in the segmentation branch decoder are fused with the features from the fourth block in the depth branch decoder. We use only one ConvBlock to represent all repeated blocks for better visualization.

Depth Estimation Branch

The depth estimation branch was based on the general U-Net architecture [18], i.e. an encoder-decoder network with skip connections, which represents local information as well as deep abstract features. The size of the input batch was b × 3 × 192 × 384, where b was the batch size, 3 was the number of channels and 192 × 384 was the size of the input image. A Resnet50 was adopted as the encoder to extract features from the input color image.

The decoder consisted of five cascaded blocks of multiple scales. Previously, multi-scale depth predictions and image reconstruction used gradient locality of a bilinear sampler [19], which was prone to create ‘holes’ and texture-copy artifacts in large low-texture regions. In our work, similar to [12], this problem is tackled by decoupling the resolutions of the disparity maps and corresponding color images used to compute the reprojection error. The lower resolution depth maps were first upsampled to the input image resolution and then reprojected and resampled. From the second block, the output of each block was taken by the convolutional layer and followed by a sigmoid activation function, which generated the disparity map at each scale. In total, 4 scales were used with output sizes b × 1 × 24 × 48, b × 1 × 48 × 96, b × 1 × 96 × 192, b × 1 × 192 × 384. The largest of these was the final disparity map which was the same size as the input image.

The final disparity map (the sigmoid output) D^ was converted to a depth map by D=1/(aD^+b), in which a and b constrained D between 0.1 and 80 units.

Segmentation Branch

The shared encoder features were also fed into the segmentation map decoder, which consisted of five convolutional layers and three upsampling layers to interpolate the features to full image resolution. The first layer was b × 256 × 6 × 12 and took the b × 2048 × 6 × 12 input followed by an ELU activation function. The upsampling layer interpolated the features to four times the input size. After the third convolutional layer, the features were concatenated into the corresponding layer from the depth estimation decoder to perform feature fusion between the two branches. The size of the segmentation subnetwork output was b × k × 192 × 384, where k = 2 was the number of classes. In practice, the depth decoder features were concatenated with the segmentation branch decoder in the fourth block. We generated the surgical instrument segmentation semi-ground truth by applying the network from [20] pretrained on the EndoVis dataset [21].

Multi-Task Loss

The network was trained end-to-end using a multi-task loss function Ct1, which was formed as

Ct1=αdpCdp1+αsgCsg1 (1)

where Cdp1 is the loss from the depth estimation branch and Csg1 is from semantic segmentation, as described below.

1. Depth Loss

In the depth estimation branch, the depth loss Cdp1 consisted of the appearance matching loss Cap1 and disparity smoothness loss Cds1 as:

Cdp1=s=14Cs1=s=14(Cap1+αdsCds1) (2)

where αds was set to 0.001.

Appearance Matching Loss

The appearance matching loss Cap1 forced the reconstructed image to be similar to the corresponding training input and was computed for the higher input resolution. During training, the autoencoder in the depth estimation branch generated a disparity map D^t from the input left color image Itl. This map was then transformed using an image sampler from the Spatial Transformer Network (STN) [19], along with the right input image Itr (the counter-part of Itl), to reconstruct the left image It1*. This sampler model used bilinear interpolation and the output pixel was the weighted sum of four input pixels. This bilinear sampler was locally fully differentiable and could be seamlessly integrated into the fully convolutional architecture, in contrast to [22]. Hence, there was no need to simplify or approximate the cost function. As in [23], we applied a combination of L1 loss and structural similarity (SSIM) index as the photometric image reconstruction cost Cap1. Training the depth estimation network then required minimizing the reconstruction loss between the reconstructed image I1* and the corresponding training input I1, where N denotes the number of pixels.

Cap1=1Ni,jγ2(1SSIM(Iij1,Iij1*))+(1γ)Iij1Iij1*1 (3)

Similar to [12], the SSIM was simplified to a 3 × 3 block filter rather than a Gaussian, and γ was set to 0.85.

Disparity Smoothness Loss

Smooth disparities were favored by this loss, and since discontinuities usually occur at image gradients [12], this cost was weighted by an edge-aware term based on the image gradients ∂I.

Cds1=1Nij|xD^ij1|e|xIijl|+|yD^ij1|e|yIij1| (4)

2. Segmentation Loss

The segmentation branch only considers the full resolution image to reduce the computational complexity. For the segmentation subnetwork, given a sequence of input images and the corresponding sequence of semi-ground truth segmentation annotations, we performed end-to-end training by minimizing the normalized pixel-wise cross-entropy loss [24], which is denoted as Csg1.

Csg1=i=1Ny^i*log(yi) (5)

where yi, y^i are the predicted value, and semi-ground truth.

Training

As there was no per-pixel depth ground truth label available, the depth estimation relied on the image reconstruction similarity, trained in self-supervised mode. For depth estimation, the data augmentation was performed by flipping 50% the input images horizontally. For segmentation, the semiground truth was provided for supervised training. The whole SDSNet was trained end-to-end with the combination of losses from each branch that involved the generation of a depth map and segmentation map.

IV. EXPERIMENTS

Experimental Setup

We evaluated our SDSNet on two datasets: Dsia [4] and Dpor [25]. For the depth estimation branch, similar to [26], we used the SSIM index to evaluate the unsupervised depth estimation. To evaluate the result of the segmentation branch, we manually labeled 400 images with the surgical tool ground truth and the segmentation performance was assessed by the Jaccard index and the Dice Score [18].

Baseline

For depth estimation, we compared the results to those from the Basic and Siamese architectures [4], Monodepth2 [26], and two non-learning methods, SPS [27] and ELAS [28]. For surgical instrument segmentation, results from the SDSNet were compared with the popular U-Net [18] architecture.

Implementation

The SDSNet model was implemented in PyTorch [29], with a batch size of 16 and an input/output resolution of 192 x 384. The learning rate was set to 10−4 for the first 15 epochs and dropped to 10−5 for the remainder. The hyperparameters αdp and αsg in Equation (1) were empirically set to 10 and 1, respectively. The network was trained for 20 epochs using Adam optimizer [30] and the training took about 8 hours on a single NVIDIA 2080 Ti GPU.

V. RESULTS

Table I summarizes the SDSNet results as well as other depth estimation methods, using the mean and standard deviation (std.) of the SSIM index. The SDSNet outperforms the other methods. More specifically, it is 1.6% higher than Monodepth2 [26] and 12.4% higher than the Siamese architecture [4]. This is a significant improvement and interestingly, we achieve the best result when both the depth estimation branch and the segmentation branch were fused together, with the added benefit of surgical instrument segmentation included.

TABLE I. SSIM Scores on the Dsia Test Set.

Mean SSIM Std.SSIM
ELAS [28] 47.3 0.079
SPS [27] 54.7 0.092
V-Basic [4] 55.5 0.106
V-Siamese [4] 60.4 0.066
Monodepth [12] 58.4 0.114
Monodepth2 [26] 71.2 0.075
SDSNet without fusion (ours) 71.9 0.079
SDSNet with fusion (ours) 72.8 0.073

Table II summarizes the segmentation results using the IoU and Dice index. It can be seen that SDSNet is not only computationally efficient but also produces superior segmentation results 5.28% higher than U-Net [18] for IoU and 5.85% for Dice index. Table II also confirms that the use of a fusion operation when performing depth estimation and segmentation simultaneously can improve the segmentation result. Example qualitative results are presented in Fig 2, showing that SDSNet provides consistent depth estimation and accurate segmentation simultaneously.

TABLE II. Segmentation Results on the Dsia Test Set.

IoU Dice Time
U-Net [18] 71.16 80.90 0.22
SDSNet (segmentation only) (ours) 73.34 84.13 0.30
SDSNet without fusion (ours) 73.44 84.59 0.38
SDSNet with fusion (ours) 74.92 85.63 0.35

Fig. 2.

Fig. 2

Qualitative results on stereo pairs. From top to bottom: the left input image, the corresponding right image in the stereo pair, the predicted depth map, and the segmentation result from (SDSNet).

Generalization

To validate the generalization of our network, an additional experiment used the model trained on the Dsia dataset but tested directly on the Dpor dataset, without retraining the whole network. Table III represents the results of SDSNet and Monodepth2 in this experiment using the SSIM index. Overall, SDSNet with fusion from both segmentation and depth estimation branch achieved higher SSIM index, confirming that the SDSNet generalizes well across different datasets, while still achieving competitive performance compared to the recent state-of-the-art methods.

TABLE III. SSIM Scores on the Dpor Test Set.

Mean SSIM Std.SSIM
Monodepth2 [26] 76.67 0.047
SDSNet with fusion (ours) 77.53 0.041

VI. CONCLUSIONS

In this work we have presented SDSNet, a joint learning network that can simultaneously segment surgical tools and estimate the depth for each pixel. The proposed fusion network achieved state-of-the-art performance in both tasks. Besides, the framework does not require any depth labels and segmentation ground truth, and thus allows superior applicability on large-scale in vivo video processing where ground truth for per-pixel depth maps and manual segmentation labels are not easy to obtain.

ACKNOWLEDGMENT

This work was supported by the UK National Institute for Health Research (NIHR) Invention for Innovation Award NIHR200035, the Cancer Research UK Imperial Centre, the Royal Society (UF140290) and the NIHR Imperial Biomedical Research Centre.

References

  • [1].Zhang D, Xiao B, Huang B, Zhang L, Liu J, Yang G-Z. A self-adaptive motion scaling framework for surgical robot remote control. RAL. 2018 [Google Scholar]
  • [2].Zhang S, Ye M, Gras G, Leibrandt K, Marcus HJ, Yang G-Z. Vision-based deformation recovery for intraoperative force estimation of tool–tissue interaction for neurosurgery. International journal of computer assisted radiology and surgery. 2016 doi: 10.1007/s11548-016-1361-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Zhang B, Tsai Y-Y, Cartucho J, Vyas K, Tuch D, Giannarou S, Elson DS. Tracking and visualization of the sensing area for a tethered laparoscopic gamma probe. IJCARS. 2020 doi: 10.1007/s11548-020-02205-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Zhang M, Johns E, Handa A, Zhang L, Pratt P, Yang G-Z. Self-supervised siamese learning on stereo image pairs for depth estimation in robotic surgery. arXiv preprint arXiv. 2017:1705.08260 [Google Scholar]
  • [5].Zhang T, Brown M, Snavely N, Lowe DG. Unsupervised learning of depth and ego-motion from video. CVPR. 2017 [Google Scholar]
  • [6].Zhang D, Wei Y, Jiang T, Wang Y, Miao R, Shan F, Li Z. Unsupervised surgical instrument segmentation via anchor generation and semantic diffusion. MICCAI. 2020 [Google Scholar]
  • [7].Zhang J, Cartucho J, Giannarou S. Autonomous tissue scanning under free-form motion for intraoperative tissue characterisation. ICRA. 2020 [Google Scholar]
  • [8].Zhang N, Srivatsan RA, Salman H, Li L, Qian J, Saxena S, Xu M, Patath K, Choset H. A surgical system for automatic registration, stiffness mapping and dynamic image overlay. ISMR. 2018 [Google Scholar]
  • [9].Zhang J, Ozay M, Zhang Y, Okatani T. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. WACV. 2019 [Google Scholar]
  • [10].Zhang X, Sinha A, Ishii M, Hager GD, Reiter A, Taylor RH, Unberath M. Dense depth estimation in monocular endoscopy with self-supervised learning methods. TMI. 2019 doi: 10.1109/TMI.2019.2950936. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Zhang S, Ricco S, Schmid C, Sukthankar R, Fragkiadaki K. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv. 2017:1704.07804 [Google Scholar]
  • [12].Zhang C, Mac Aodha O, Brostow GJ. Unsupervised monocular depth estimation with left-right consistency. CVPR. 2017 [Google Scholar]
  • [13].Zhang S, Delles M, Gutt C, Dillmann R. Tracking of instruments in minimally invasive surgery for surgical skill analysis; International Workshop on Medical Imaging and Virtual Reality; 2006. [Google Scholar]
  • [14].Zhang Z, Voros S, Hager GD. Articulated object tracking by rendering consistent appearance parts. ICRA. 2009 [Google Scholar]
  • [15].García-Peraza-Herrera LC, Li W, Gruijthuijsen C, Devreker A, Attilakos G, Deprest J, Vander Poorten E, Stoyanov D, Vercauteren T, Ourselin S. Real-time segmentation of non-rigid surgical tools based on deep learning and tracking; International Workshop on Computer-Assisted and Robotic Endoscopy; 2016. [Google Scholar]
  • [16].Zhang X, Zhao X, Hu H, Fang L. Edgestereo: A context integrated residual pyramid network for stereo matching. ACCV. 2018 [Google Scholar]
  • [17].Zhang F, Jonmohamadi Y, Maicas G, Pandey AK, Carneiro G. Self-supervised depth estimation to regularise semantic segmentation in knee arthroscopy. MICCAI. 2020 [Google Scholar]
  • [18].Zhang O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. MICCAI. 2015 [Google Scholar]
  • [19].Zhang M, Simonyan K, Zisserman A, Kavukcuoglu K. Spatial transformer networks. arXiv preprint arXiv. 2015:1506.02025 [Google Scholar]
  • [20].Shvets AA, Rakhlin A, Kalinin AA, Iglovikov VI. Automatic instrument segmentation in robot-assisted surgery using deep learning. ICMLA. 2018 [Google Scholar]
  • [21].Miccai 2017 endoscopic vision challenge: Robotic instrument segmentation sub-challenge. 2017. [Online]. Available: https://endovissub2017-roboticinstrumentsegmentation.grand-challenge.org/Data/
  • [22].Zhang R, Bg VK, Carneiro G, Reid I. Unsupervised cnn for single view depth estimation: Geometry to the rescue. ECCV. 2016 [Google Scholar]
  • [23].Zhang H, Gallo O, Frosio I, Kautz J. Loss functions for neural networks for image processing. arXiv preprint arXiv. 2015:1511.08861 [Google Scholar]
  • [24].Zhang J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. CVPR. 2015 doi: 10.1109/TPAMI.2016.2572683. [DOI] [PubMed] [Google Scholar]
  • [25].Zhang P, Stoyanov D, Yang G-Z. Three-dimensional tissue deformation recovery and tracking. SPM. 2010 [Google Scholar]
  • [26].Zhang C, Mac Aodha O, Firman M, Brostow GJ. Digging into self-supervised monocular depth estimation. ICCV. 2019 [Google Scholar]
  • [27].Zhang K, McAllester D, Urtasun R. Efficient joint segmentation, occlusion labeling, stereo and flow estimation. ECCV. 2014 [Google Scholar]
  • [28].Zhang A, Roser M, Urtasun R. Efficient large-scale stereo matching. ACCV. 2010 [Google Scholar]
  • [29].Zhang A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. 2017 [Google Scholar]
  • [30].Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv. 2014:1412.6980 [Google Scholar]

RESOURCES