Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jun 2.
Published in final edited form as: Proc SPIE Int Soc Opt Eng. 2022 Apr 4;12034:120341U. doi: 10.1117/12.2610778

A Weakly Supervised Learning Approach for Surgical Instrument Segmentation from Laparoscopic Video Sequences

Zixin Yang a, Richard Simon b, Cristian Linte a,b
PMCID: PMC9161723  NIHMSID: NIHMS1808218  PMID: 35663908

Abstract

Fully supervised learning approaches for surgical instrument segmentation from video images usually require a time-consuming process of generating accurate ground truth segmentation masks. We propose an alternative way of labeling surgical instruments for binary segmentation that first commences with rough, scribble-like annotations of the surgical instruments using a disc-shaped brush. We then present a framework that starts with a graph-model-based method for generating initial segmentation labels based on the user-annotated paint-brush scribbles and then proceeds with a deep learning model that learns from the noisy, initial segmentation labels. Experiments conducted on the 2017 MICCAI EndoVis Robotic Instrument Segmentation Challenge have shown that the proposed framework achieved a 76.82% IoU and 85.70% Dice score on binary instrument segmentation. Based on these metrics, the proposed method out-performs other weakly supervised techniques and achieves a close performance to that achieved via fully supervised networks, but eliminates the need for ground truth segmentation masks.

Keywords: Weakly supervised segmentation, learning with noisy labels, surgical instrument segmentation

1. INTRODUCTION

Supervised deep learning approaches have produced impressive results in surgical instrument segmentation. However, densely annotating a large number of video frames pixel by pixel is time-consuming and costly. Alternatively, weakly supervised approaches use limited, or imprecise labeled data as training data. The use of weak labels is intended to simplify the efforts required for hand-labeling data.

In this study, we propose an alternative way to annotating surgical instruments for semantic segmentation (Fig. 1(b)). We evaluate whether we can start from such annotated hard constraints to generate initial segmentation labels (Fig. 1(c)) which then serve as weak labels used to train a deep learning neural network that yields binary segmentation predictions of the surgical tools (Fig. 1(d)). The use of such simple, scribble-like paint-brush strokes is a much more attractive form of user input than traditional, accurate ground truth annotations, and pose a significant benefit when accurate ground truth annotations are not available, which is often the case. If this weakly supervised approach can provide sufficiently accurate segmentation results, it will be an attractive alternative to deep learning-based segmentation methods1-4 that rely on accurate segmentation masks.

Figure 1.

Figure 1.

Weakly supervised surgical instrument segmentation based on an input image (a) and a given weak label (b). The figure panels describe the following: a) input image; b) hard constraints created by a disc-shaped brush stroke (thick gray trace), whereas the thin white trace indicates the trajectory described by the center of the disc-shaped brush (i.e., center-line of the thick gray brush stroke); c) initial segmentation label of the surgical instrument generated based on the weak label using graph cut and super-pixel classification; d) surgical instrument segmentation mask predicted by our proposed weakly supervised framework.

Our main contributions are as follows: 1) we propose an alternative way of annotating surgical instruments, which is simple, straightforward, and provides rich information. 2) we propose a weakly supervised segmentation framework consisting of initial segmentation labels generation and learning based on these noisy initial segmentation labels. 3) we demonstrate that the proposed method outperforms other state-of-art weakly supervised methods and exhibits close performance to the other fully-supervised methods.

2. METHODOLOGY

We begin with introducing our alternative approach for partially annotating images (Fig. 1(b)). Subsequently, we describe our segmentation method the first uses super-pixels5 and graph-cut6 to generate initial segmentation labels (Fig. 1(c)) from annotated hard constraints (Fig. 1(b)), then trains a convolutional neural network (CNN) model on the initial segmentation labels (weak labels) containing noise in two training stages: warm-up stage and stable stage (Fig. 2).

Figure 2.

Figure 2.

Illustration of proposed method comprising two stages: warm-up stage and stable stage. In the first stage, the network is warmed up with training on pseudo labels. After γp epochs, the network is stabilized with self-generated moving average predictions filtered with hard constraints. The moving average calculations begins after γm (γm < γp).

2.1. Hard Constraints

Our proposed method relies on hard constraints (partial annotations) of surgical instruments that can be easily generated using “paint-brush”-like strokes along the surgical instruments with a disc-shaped brush, as seen in Fig. 1(b). The brush roughly moves along the center lines of instruments until the brushed areas encompass the region containing the surgical instruments. To cover multiple surgical instruments within a frame, we utilize the same size disc-shaped brush. We argue that identifying such hard constraints of foreground objects with our proposed approach takes significantly less time than the traditional way of scribbling foreground and background regions separately while labeling more background. Brush trajectories annotate foreground pixels (i.e., surgical instrument), while the remaining un-brushed regions denote background pixels. Lastly, brushed pixels excluded from brush trajectories are marked as unknown.

2.2. Initial Segmentation Label Generation from Hard Constraints

Directly training a deep learning neural network (DNN) on hard constraints may lead to poor segmentation results, as foreground objects have not been fully identified. Motivated by Lin et al.,7 we apply a graph cut6 method on the super-pixels generated from the simple linear iterative clustering (SLIC)5 to generate initial segmentation labels subsequently used to train a DNN. A super-pixel is represented by a vertex in the graph, and a similarity between two super-pixels is represented by an edge in the graph. We use custom-defined features based on color cumulative normalized histograms to represent attributes of super-pixels. The histograms are built on the L and A channel of CIELAB color space, and S channel of HSV color space, using n bins for each channel. Those normalized color channels are concatenated to form the features. Kullback–Leibler divergence (KLDiv) is used to measure similarities between hand-craft features. The graph cut method6 assigns prediction labels to super-pixels of unknown regions by measuring their similarities to foreground super-pixels and background super-pixels.

2.3. Learning from Noisy Initial Segmentation Labels

Initial segmentation labels generated from the graph-based methods provide rich knowledge for training a deep learning network; however, the noise associated with the initial segmentation labels impedes learning, as incorrect labels lead to inconsistent predictions of the network over different training epochs. To handle this limitation, we divide our training process into a warm-up stage and a stabilization stage.

Warm up stage.

We use standard binary cross-entropy loss to train the network on initial segmentation labels with a number of epochs γp:

L1=1NiN(Pilog(pi)+(1Pi)log(1pi)), (1)

where N is the number of pixels of a training image, pi is the prediction map, and Pi is the initial segmentation label. In the beginning, DNNs learn from simple samples and eventually adapt to more difficult ones during training.8 Over multiple training epochs, the network’s prediction is likely to be constant on correctly labeled pixels and inconsistent or strongly oscillating on incorrectly labeled pixels. After γm epochs (γm < γp), we begin to calculate moving averages of predictions to filter incorrectly labeled pixels until the end of the stable stage:

MA(x,j)=αpj+(1α)MA(x,j1), (2)

whereby the moving average MA(x, j) at the current epoch j with training images x is calculated by averaging current prediction yj and the previous MA(x, j − 1) with the momentum α. The MA containing less noisy labels will be used for the later stable stage.

Stabilization stage.

We utilize MA-filtered hard constraints PMA to train the network with binary cross-entropy:

L2=1NiN(PMAilog(pi)+(1PMA)log(1pi)). (3)
PMA=[(MA(x,j1)Mbg)Mfg>τ]]. (4)

Here, Eq. 4 forces pre-computed MA to be aligned with the annotated hard constraints. Mbg is the background mask from the hard constraints, where background pixels are labeled with 0s and other pixels are labeled with 1s. Mfg is the foreground mask from the hard constraints, where foreground pixels are labeled with 1s and other pixels are labeled with 0s. [] is the Iverson bracket, taking the value 1, if the statement inside the bracket is true and, otherwise, taking the value 0, while τ is the ensemble threshold.

2.4. Dataset and Evaluation Metrics

We evaluate our methods on the 2017 MICCAI EndoVis Robotic Instrument Segmentation Challenge dataset (EndoVis 2017).9 Our methods focus on the binary instrument segmentation task, separating instruments from backgrounds. For a fair and straight-forward comparison, we follow the prior evaluation convention.2 More specifically, we employ 4-fold cross-validation with the same partitions of 8 × 225 released training data and use intersection-over-union (IoU) and the Dice coefficient as evaluation metrics.

2.5. Implementation

For the purpose of this study, instead of generating the hard constraints by manually scribbling the paint-brush strokes on the video image frames, we automatically generated hard constraints (i.e., synthetic paint-brush strokes) based on the ground truth segmentation masks by adding simulated noises. We first compute the distance-based medial axis skeleton on ground truth masks to obtain brush trajectories, then mimic manual scribbling by adding sinusoidal noise to obtain the brush trajectories:

y=y+A1sin(ω1x)+A2sin(ω2x), (5)

where x and y are horizontal and vertical locations respectively of brush trajectories. New vertical locations featuring sine-wave noise are marked as y′. A1, A2 ∈ [0, 15], and ω1, ω2 ∈ [0, π/128] are random variables different for each images. Finally, trajectories are dilated with a disk kernel of size 20. The maximum distance value-added with a random value between 140 pixels and 150 pixels is used to define the size of the disc-shaped brush strokes.

Our network is based on DeeplabV3plus10 with ResNet50 as encoder and utilizes the Adam optimizer. The hyper-parameters used for initial segmentation label generation are the number of bins n = 20; the number of super-pixels per image 2000. The hyper-parameters that we used in our model are: image size 512 × 512; momentum α = 0.2; ensemble threshold = 0.5; epoch to start MA calculation γm = 20; epoch to start the stable stage γp = 30; learning rate 10−3; batch size 8. Data augmentation includes randomly shifting, scaling, horizontal flip, vertical flip, and diagonal flip.

3. EVALUATION AND RESULTS

Quantitative and qualitative comparison results are shown in Table 1 and Figure 3, respectively. We compare our method with three fully supervised methods2, 11, 12 and two weakly supervised segmentation methods.13, 14 Regularized Loss (rLoss)13 and Weakly Supervised Salient Object Detection (WSOD)14 do not rely on initial segmentation labels to train DNNs and use hard constraints for training. We use our hard constraints to train rLoss13 and WSOD.14 rLoss13 integrated DenseCRF into the loss function. Zhang et al,14 added an auxiliary edge detection network to the segmentation network and used a gated structure-aware loss to encourage the structure of the prediction to be similar to the object in the image.

Table 1.

Quantitative comparison results in terms of IoU and Dice (mean ± std). Results of previous fully-supervised methods are cited in.11 Our network is the DeeplabV3plus10 with ResNet50 as encoder. The statistical significance between the proposed method and other weakly supervised methods is identified by *(p < 0.05).

Type Method IoU (%) Dice (%)
Fully Supervised U-net12 75.44 ± 18.18 84.37 ± 14.58
TernausNet2 83.60 ± 15.83 90.01 ± 12.50
MF-TAPNet11 87.56 ± 16.24 93.37 ± 12.93
DeeplabV3plus10 83.84 ± 14.33 90.42 ± 10.25
Initial Segmentation Proposed Weak Label Generation 78.89 ± 9.16 87.85 ± 6.90
Weakly Supervised * rLoss13 + 63.94 ± 12.87 77.16 ± 10.98
* WSOD14 53.49 ± 15.13 68.33 ± 14.07
Proposed Method 76.82 ± 16.01 85.70 ± 13.41

Figure 3.

Figure 3.

Qualitative comparison between our proposed method and other weakly supervised methods. rLoss and WSOD take hard constraints as input, while our proposed method takes initial labels as input.

The proposed framework achieves 76.82% IoU and 85.70% Dice score, which is comparable to U-net,12 a supervised method, and out performs other weakly-supervised segmentation methods. rLoss13 and WSOD14 failed to find correct boundaries of instruments (Figure 3). They were originally designed for scribbles of foreground and background, where foreground labeled pixels and background labeled pixels are more balanced than our proposed weak labels. They did not generalize well to this type of label. The differences between the IoU errors from the proposed method and other state-of-art weakly supervised methods are statistical significance via t-test (p < 0.05).

4. CONCLUSION

Deep learning methods are powerful, but they require accurate ground truth for training; if not available, then results tend to be inaccurate, as deep learning methods over-fit to inaccurate training labels. Hence, there is a need for deep learning methods that ensure good, robust segmentation methods that are not fully dependent on the availability of very accurate ground truth labels, as such labels are often not available. Alternative segmentation methods that can operate on noisy, initial segmentation labels instead of expert manual ground truth annotations are attractive. Here we described a method that uses weak labels to train a CNN and demonstrate improved performance versus other weakly supervised methods and comparable performance to fully-supervised methods. Moreover, the weak labels do not require expert manual annotations, but rather consist of initial noisy segmentations obtained using a graph cut segmentation approach that uses rough paint-brush strokes scribbled by any user to define the hard constraints (foreground vs. background). The proposed framework outperformed other state-of-art weakly supervised methods and exhibited close performance to several fully-supervised methods.

5. ACKNOWLEDGEMENTS

Research reported in this publication was supported in part by the National Institute of General Medical Sciences of the National Institutes of Health under Award No. R35GM128877.

REFERENCES

  • [1].Hasan SK and Linte CA, “U-netplus: A modified encoder-decoder u-net architecture for semantic and instance segmentation of surgical instruments from laparoscopic images,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 7205–7211, IEEE; (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Shvets AA, Rakhlin A, Kalinin AA, et al. , “Automatic instrument segmentation in robot-assisted surgery using deep learning,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 624–628, IEEE; (2018). [Google Scholar]
  • [3].Qin F, Li Y, Su Y-H, et al. , “Surgical instrument segmentation for endoscopic vision with data fusion of cnn prediction and kinematic pose,” in 2019 International Conference on Robotics and Automation (ICRA), 9821–9827, IEEE; (2019). [Google Scholar]
  • [4].Ni Z-L, Bian G-B, Wang G-A, et al. , “Barnet: bilinear attention network with adaptive receptive field for surgical instrument segmentation,” arXiv preprint arXiv:2001.07093 (2020). [Google Scholar]
  • [5].Achanta R, Shaji A, Smith K, et al. , “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE transactions on pattern analysis and machine intelligence 34(11), 2274–2282 (2012). [DOI] [PubMed] [Google Scholar]
  • [6].Boykov Y and Kolmogorov V, “An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision,” IEEE transactions on pattern analysis and machine intelligence 26(9), 1124–1137 (2004). [DOI] [PubMed] [Google Scholar]
  • [7].Lin D, Dai J, Jia J, et al. , “Scribblesup: Scribble-supervised convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 3159–3167 (2016). [Google Scholar]
  • [8].Nguyen DT, Dax M, Mummadi CK, et al. , “Deepusps: Deep robust unsupervised saliency prediction with self-supervision,” arXiv preprint arXiv:1909.13055 (2019). [Google Scholar]
  • [9].Allan M, Shvets A, Kurmann T, et al. , “2017 robotic instrument segmentation challenge,” arXiv preprint arXiv:1902.06426 (2019). [Google Scholar]
  • [10].Chen L-C, Zhu Y, Papandreou G, et al. , “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 801–818 (2018). [Google Scholar]
  • [11].Jin Y, Cheng K, Dou Q, et al. , “Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 440–448, Springer; (2019). [Google Scholar]
  • [12].Ronneberger O, Fischer P, and Brox T, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, 234–241, Springer; (2015). [Google Scholar]
  • [13].Tang M, Perazzi F, Djelouah A, et al. , “On regularized losses for weakly-supervised cnn segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 507–522 (2018). [Google Scholar]
  • [14].Zhang J, Yu X, Li A, et al. , “Weakly-supervised salient object detection via scribble annotations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12546–12555 (2020). [Google Scholar]

RESOURCES