Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2022 Sep 21.
Published in final edited form as: IEEE Trans Med Robot Bionics. 2022 May;4(2):331–334. doi: 10.1109/TMRB.2022.3170206

Self-supervised Monocular Depth Estimation with 3D Displacement Module for Laparoscopic Images

Chi Xu 1, Baoru Huang 1, Daniel S Elson 1
PMCID: PMC7613618  EMSID: EMS153973  PMID: 36148138

Abstract

We present a novel self-supervised training framework with 3D displacement (3DD) module for accurately estimating per-pixel depth maps from single laparoscopic images. Recently, several self-supervised learning based monocular depth estimation models have achieved good results on the KITTI dataset, under the hypothesis that the camera is dynamic and the objects are stationary, however this hypothesis is often reversed in the surgical setting (laparoscope is stationary, the surgical instruments and tissues are dynamic). Therefore, a 3DD module is proposed to establish the relation between frames instead of ego-motion estimation. In the 3DD module, a convolutional neural network (CNN) analyses source and target frames to predict the 3D displacement of a 3D point cloud from a target frame to a source frame in the coordinates of the camera. Since it is difficult to constrain the depth displacement from two 2D images, a novel depth consistency module is proposed to maintain depth consistency between displacement-updated depth and model-estimated depth to constrain 3D displacement effectively. Our proposed method achieves remarkable performance for monocular depth estimation on the Hamlyn surgical dataset and acquired ground truth depth maps, outperforming monodepth, monodepth2 and packnet models.

Keywords: Deep learning, self-supervised learning, CNN, 3D displacement, monocular depth estimation

I. Introduction

MINIMALLY invasive surgery (MIS) is widely applied in general surgery because of the low trauma for patients [1]–[3]. Compared with traditional open surgery, MIS provides visualization of in vivo environments via laparoscopic vision. Since 2D laparoscopic images lack depth information that is available for naked eye 3D human perception and decision making, it may be useful to estimate accurate per-pixel depth maps from these images to reconstruct precise 3D tissues and internal scenes. This will not only provide the surgeon with a realistic surgical experience but also allow other image guidance technologies to be seamlessly incorporated into the procedure. Although depth can also be estimated for images from stereo laparoscopes using various methods, these are only available for certain procedures and locations, with monocular endoscopes remaining more popular [4].

Many monocular depth estimation methods have been proposed, such as monocular feature-based methods (e.g. Structure from Motion (SfM) [5]), supervised learning and self-supervised learning. Monocular feature-based methods utilize conventional feature extractors [6], [7] and feature matching methods to infer ego-motion matrix between frames and depth map, but it is difficult to carry out effective feature matching [4], [8] in some stereo laparoscopic images due to the low number of texture features.

Deep learning algorithms may be more robust for in vivo environments, and self-supervised learning models have become popular because it is challenging to acquire ground truth depth maps in real world settings, especially in MIS. Self-supervised learning-based approaches take synthetic target images as the supervisory signal to train the depth estimation model. Stereo training and monocular training are two existing frameworks for self-supervised monocular depth estimation. The monocular training utilizes an additional network to predict the ego-motion matrix between frames to synthesize target images from adjacent frames under the hypothesis of moving camera and static scene [4], [8]–[12]. The stereo training makes use of the geometrical relation between rectified stereo images to infer a dense stereo disparity map whereby the left/right image can be reconstructed by horizontally shifting pixels of the right/left image [13], [14]. In surgery, the position of the laparoscope is often static and the scenes are dynamic (moving surgical instruments and deforming tissues), leading to the identity matrix (ego-motion matrix) and many pixels remaining stationary. Therefore, it is necessary to implement a new module to establish a relationship between frames in monocular training and utilize stereo view synthesis of stereo training for image synthesis.

In this paper, we propose a new self-supervised learning based framework for monocular depth estimation in laparoscopic imaging. Three contributions are achieved: (1) A 3-branch Siamese network was designed to enhance the interaction between adjacent frames during training, improving the performance of the depth estimation model; (2) A 3DD module was formulated to estimate the per-pixel 3D displacement map of 3D point clouds between adjacent frames, establishing a novel relationship between adjacent frames. This module replaces the conventional ego-motion module and matches the surgical scenario well; (3) The depth consistency loss and monocular appearance loss were used to train the 3DD network (3DD-Net).

II. Methods

The overall training framework is depicted in Fig. 1 and several key ideas are introduced in the following sections together with the three loss functions for training the depth estimation model and 3DD-Net.

Fig. 1. Framework architecture.

Fig. 1

The Resnet 18 [15] is pre-trained. The dark blue arrow indicates bilinear interpolation from multi-scale outputs to original scaled outputs. The colored lines are used to indicate correspondence between output data and loss function (red for lap, blue for ld, green for ls).

Framework Architecture

A 3-branch Siamese network - composed of three identical and weight-sharing auto-encoder networks corresponding to target (It) and source (It : It−1 and It+1) frames respectively - was used to predict dense depth maps (simplified in Fig. 1). It was tested using a single auto-encoder from the 3-branch Siamese network. The 3DD-Net took a stack of It and It as inputs, with an output in the form of a 3-channel tensor of the same spatial dimension as the input data. The output is called as 3D displacement map, describing the 3-dimensional displacement (x, y and z directions) of each 3D point cloud between two adjacent frames. To generate a self-supervisory signal, the 3DD module (described in next section) and view synthesis were used to describe the 3D displacement of the 3D point cloud between adjacent frames and reconstruct the target frame respectively. Finally, three loss functions were used to train the depth estimation model and 3DD-Net (details below).

3D Displacement Module:

The inputs of the 3DD module were the predicted disparity map of the source frame (dt), the 3D point cloud of the target frame (Pt) and the 3DD map (DMtt). The 3DD module not only modified the 3D point cloud for stereo view synthesis, but also generated a depth consistency loss and monocular appearance loss to enable the 3DD-Net to learn and limit the 3D displacement. As shown in Fig. 2, the DMtt changed the 3D point cloud from the target frame to the source frame to generate a depth map from displacement (Dt¯). The sampled depth map (Dt) could be generated by the following formula:

Dt=Dtproj(Pt,DMtt,K) (1)
Fig. 2. The 3DD module architecture.

Fig. 2

The orange and purple lines represent the inputs and outputs respectively.

Here Dt is the predicted depth map of the source frame; proj() is the 2D projecting function to generate 2D sampling coordinates and K is the pre-computed intrinsic matrix of the camera; is the sampling operation. Dt and Dt˜, are compared to limit the displacement in the z direction and maintain the depth consistency between adjacent frames. The 2D sampling coordinates were also applied to monocular view synthesis and the synthetic image limited the 2D displacement in the image plane (x and y dimension) by using the monocular appearance loss.

View Synthesis:

In our framework, there are two view synthesis processes: monocular and stereo. The monocular view synthesis reconstructs images in the same coordinates as the camera. Such reconstructed images are used to constrain the displacement of x and y dimensions. Due to the laparoscope remaining stationary, most pixels captured by the camera have no disparity between adjacent frames, which makes depth estimation difficult to learn from the appearance loss during training and results in a high number of infinite depth predictions during testing. To solve this, stereo view synthesis sampled pixel values from the other image of the stereo pair for training. Two view synthesis operations are described in formulas (2) and (3).

Itt,m=It,mproj(Pt,DMtt,K) (2)
Itt,s=It,sproj(Pt,DMtt,Ts,K) (3)

Where Itt,m and Itt,s are target images reconstructed by monocular view synthesis and stereo view synthesis respectively; It′→t,m and It′→t,s are the captured source images for monocular and stereo view synthesis respectively; Ts is the pre-computed extrinsic matrix, changing the coordinates of the 3D point cloud for stereo view synthesis (details in Fig. 2).

View-field Mask:

For stereo view synthesis, pixels from the leftmost region of the left image were not sampled because they were out of view for the right camera. The appearance loss from such regions caused degradation [10] and must be masked. Further, the view-field mask prevented the 3DD-Net generating abnormal displacement in the z direction. Therefore, a mask excluded such regions and was generated from 2D sampling coordinates in stereo view synthesis [9], [10], [16]. When the depth estimation model predicted depth maps of input frames (Dt), Pt could be generated by multiplying by K−1. Then, the reference coordinates of Pt were changed from the target frame to the source frame by Ts. Finally, the 2D sampling coordinate could be computed by multiplying by K. In the 2D sampling coordinates, the coordinates of effective pixels were between -1 and 1 and other values represented pixels that are out of the field of view. Therefore, the mask (M) could be generated by the following formulae:

coord=proj(Dt,K1,Ts,K) (4)
M={1ifcoord(i,j,:)[1,1]0else (5)

Loss Function

In this section, we propose a loss function to retain depth consistency between adjacent frames and efficiently constrain 3D displacement.

Appearance Loss:

The appearance loss was composed of a monocular and stereo appearance loss. Both loss functions combined SSIM loss [17] and L1 loss [18] in a specific proportion [13]. The monocular appearance loss was generated by comparing the target image with an image reconstructed by monocular view synthesis, which aimed to constrain the displacement to the x and y dimensions. Inspired by [11], perpixel minimum loss was applied to the monocular appearance loss to better handle occlusions caused by moving surgical instruments. The stereo appearance loss compared the target and reconstructed images, enabling the depth estimation model to learn depth from frames and stereo pairs. In this framework, the I^tt and It^ were images that only contained overlapping regions of stereo pairs (masked out by M). The α was equal to 0.85.

laps(t)=α2(1SSIM(I^tt,s,It^))+(1α)I^tt,sIt^ (6)
lapm(t)=mintα2(1SSIM(I^tt,m,It^))+(1α)I^tt,mIt^ (7)

Depth Consistency Loss:

Inspired by Godard et al.’s left-right consistency [13], we propose the depth consistency loss to constrain the displacement in the z dimension and maintain the depth consistency between frames. To balance the contributions of depth consistency loss and appearance loss, the normalized loss was used [8].

Id(t)=Σ(M(DtDt˜)2)Σ(M(Dt2+Dt2˜)) (8)

Edge-aware Smoothness Loss:

The edge-aware smoothness loss is widely used in loss function for depth estimation to reduce the noisy depth values, except values at edges.

ls=|xdt*|e|xIt|+|ydt*|e|yIt| (9)

Overall Loss:

The overall loss is shown in equation (10). The F is the set of source frames: {t−1, t+1, s}. Considering that the depth consistency loss was added to the loss function, the weight λ for smoothness loss was increased to 0.002 to retain the smoothness contribution.

L=ΣtFlaps(t)+lapm4+Σt{t1,t+1}ld(t)2+λls (10)

III. Experimental Setup and Results

Experiment Setup

The Hamlyn surgical dataset [19] was used to train and evaluate the models, containing 192 × 382 rectified stereo laparoscopic image pairs, 34240 pairs for training and 7191 pairs for validation. The stereo camera had the same intrinsic matrix for all laparoscopic images. The stereo pairs were rectified, therefore the extrinsic matrix between stereo camera was a horizontal translation with a fixed distance. Furthermore, to make the experiment more convincing, 100 laparoscopic images with ground truth depth maps acquired by projected gray-code structured lighting and were collected and analysed, as shown in Fig. 4 [20].

Fig. 4.

Fig. 4

The acquired ground truth depth maps via da Vinci (Intuitive Inc.) stereo laparoscope and projected gray-code structured light pattern [20].

The model was implemented in Pytorch [21]. For training, optimizer Adam was used in 15 epochs with a batch size of 12 and a learning rate set to 0.0001. The training took 12 hours with a single 16GB NVIDIA Tesla P100. The monodepth [13], monodepth2 [11] and packnet [12] models were implemented for comparison. Considering that monodepth2 achieved the same performance in both monocular and stereo training, only stereo training was applied for monodepth2 [11] and packnet [12] to make the comparison fair.

Comparison Study

In this section, we implemented three self-supervised models for comparison. The SSIM based on the Hamlyn surgical dataset [19] and the metrics based on acquired ground truth depth maps were taken as criteria to evaluate the predicted depth maps, as shown in Table I. Further, the qualitative comparison was also conducted between models, as depicted in Fig. 3.

Table I. Evaluation Based on SSIM and Ground Truth.

μssim μssim μABE μABE Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253 Params
Monodepth 0.5843 0.1140 5.741 3.631 0.295 95.195 56.887 0.320 0.589 0.875 0.962 20M
PackNet 0.7380 0.0698 3.893 1.584 0.194 3.857 14.507 0.311 0.660 0.922 0.975 120M
Monodepth2 0.7199 0.0826 3.152 1.009 0.160 2.281 11.345 0.206 0.730 0.976 0.997 14M
Ours 0.7421 0.0641 2.684 0.913 0.136 1.758 9.829 0.165 0.818 0.991 1.000 14M

Fig. 3.

Fig. 3

Qualitative result comparison between our method, packnet [12], monodepth2 [11], monodepth [13]. The first column contains example test images. The other columns are the corresponding disparity maps.

For testing, we selected one branch of the three-branch Siamese network as the depth estimation model. Compared with the other models, our model performed better and had the fewest parameters. The qualitative result comparison shows that the artifacts were significantly reduced in our model, including border artifacts (appearing at regions of source frames not visible in both images) and texture-copy artifacts (caused by incorrect translation from input images). The border artifacts were mainly removed by view-field masking (in Fig. 5) and the texture-copy artifacts were reduced by the proposed depth consistency loss (in Fig. 3).

Fig. 5. The effect of view-field masking is shown in red boxes.

Fig. 5

Ablation Study

In order to study the contribution of 3DD module and depth consistency loss (DCL), an ablation study with acquire ground truth depth maps was conducted. The monodepth2 was set as the baseline model. As shown in Table II, the baseline model trained with 3DD module had improved performance. When the baseline model was replaced by 3-branch Siamese network with DCL, the improvement was significant.

Table II. Ablation Study.

3DD Module DCL Abs Rel Sq Rel RMSE
Baseline 0.160 2.281 11.345
Baseline 0.154 2.137 10.992
Siamese-net 0.136 1.758 9.829

IV. Conclusion

We proposed a novel self-supervised framework for monocular depth estimation, achieving state-of-the-art performance not only for the Hamlyn surgical dataset [19], but also for a newly acquired dataset with ground truth. We modified the conventional single auto-encoder network with a 3-branch Siamese network for training, enforcing the interaction between adjacent frames. The 3DD module also significantly improved the model performance via depth consistency and monocular appearance losses.

Acknowledgments

This work was carried out with support from the UK National Institute for Health Research (NIHR) Invention for Innovation Award NIHR200035, the Cancer Research UK Imperial Centre and the NIHR Imperial Biomedical Research Centre.

Contributor Information

Chi Xu, Email: chi.xu20@imperial.ac.uk.

Baoru Huang, Email: baoru.huang18@imperial.ac.uk.

Daniel S. Elson, Email: daniel.elson@imperial.ac.uk.

References

  • [1].Zhang K. Minimally invasive surgery. Endoscopy. 2002 [Google Scholar]
  • [2].Zhang V, Melis M, Amato B, Bianco T, Rocca A, Amato M, Quarto G, Benassai G. Minimally invasive radioguided parathy- roid surgery: A literature review. IJS. 2016 doi: 10.1016/j.ijsu.2015.12.037. [DOI] [PubMed] [Google Scholar]
  • [3].Westebring-van der Putten EP, Goossens RH, Jakimowicz JJ, Dankelman J. Haptics in minimally invasive surgery–a review. Minimally Invasive Therapy & Allied Technologies. 2008 doi: 10.1080/13645700701820242. [DOI] [PubMed] [Google Scholar]
  • [4].Zhang L, Li X, Yang S, Ding S, Jolfaei A, Zheng X. Unsupervised learning-based continuous depth and motion estimation with monocular endoscopy for virtual reality minimally invasive surgery. TII. 2020 [Google Scholar]
  • [5].Zhang S, Sinha A, Reiter A, Ishii M, Gallia GL, Taylor RH, Hager GD. Evaluation and stability analysis of video-based navigation system for functional endoscopic sinus surgery on in vivo clinical data. TMI. 2018 doi: 10.1109/TMI.2018.2833868. [DOI] [PubMed] [Google Scholar]
  • [6].Lowe DG. Distinctive image features from scale-invariant keypoints. IJCV. 2004 [Google Scholar]
  • [7].Zhang E, Rabaud V, Konolige K, Bradski G. Orb: An efficient alternative to sift or surf; ICCV; 2011. [Google Scholar]
  • [8].Zhang X, Sinha A, Ishii M, Hager GD, Reiter A, Taylor RH, Unberath M. Dense depth estimation in monocular endoscopy with self-supervised learning methods. TMI. 2019 doi: 10.1109/TMI.2019.2950936. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Zhang T, Brown M, Snavely N, Lowe DG. Unsupervised learning of depth and ego-motion from video; CVPR; 2017. pp. 1851–1858. [Google Scholar]
  • [10].Zhang R, Wicke M, Angelova A. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints; CVPR; 2018. [Google Scholar]
  • [11].Zhang C, Mac Aodha O, Firman M, Brostow GJ. Digging into self-supervised monocular depth estimation; ICCV; 2019. [Google Scholar]
  • [12].Zhang V, Ambrus R, Pillai S, Raventos A, Gaidon A. 3d packing for self-supervised monocular depth estimation; CVPR; 2020. [Google Scholar]
  • [13].Zhang C, Mac Aodha O, Brostow GJ. Unsupervised monocular depth estimation with left-right consistency; CVPR; 2017. [Google Scholar]
  • [14].Zhang B, Zheng J-Q, Giannarou S, Elson DS. H-net: Un- supervised attention-based stereo depth estimation leveraging epipolar geometry. arXiv preprint. 2021:arXiv:2104.11288 [Google Scholar]
  • [15].Zhang K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition; CVPR; 2016. [Google Scholar]
  • [16].Zhang R, Bg VK, Carneiro G, Reid I. Unsupervised cnn for single view depth estimation: Geometry to the rescue; ECCV; 2016. [Google Scholar]
  • [17].Zhang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. TIP. 2004 doi: 10.1109/tip.2003.819861. [DOI] [PubMed] [Google Scholar]
  • [18].Zhang H, Gallo O, Frosio I, Kautz J. Is l2 a good loss function for neural networks for image processing? arxiv preprint. arXiv preprint. 2015:arXiv:1511.08861 [Google Scholar]
  • [19].Zhang M, Johns E, Handa A, Zhang L, Pratt P, Yang G-Z. Self- supervised siamese learning on stereo image pairs for depth estimation in robotic surgery. arXiv preprint. 2017:arXiv:1705.08260 [Google Scholar]
  • [20].Scharstein D, Szeliski R. High-accuracy stereo depth maps using structured light; CVPR; 2003. [Google Scholar]
  • [21].Zhang A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. 2017 [Google Scholar]

RESOURCES