Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Oct 1.
Published in final edited form as: Int J Comput Assist Radiol Surg. 2022 May 30;17(10):1801–1811. doi: 10.1007/s11548-022-02681-5

Video-based assessment of intraoperative surgical skill

Sanchit Hira 1,, Digvijay Singh 2,, Tae Soo Kim 2,3,, Shobhit Gupta 4, Gregory Hager 1,2,3, Shameema Sikder 1,3,5, S Swaroop Vedula 3,*
PMCID: PMC10323985  NIHMSID: NIHMS1905171  PMID: 35635639

Abstract

Purpose:

Surgeons’ skill in the operating room is a major determinant of patient outcomes. Assessment of surgeons’ skill is necessary to improve patient outcomes and quality of care through surgical training and coaching. Methods for video-based assessment of surgical skill can provide objective and efficient tools for surgeons. Our work introduces a new method based on attention mechanisms and provides a comprehensive comparative analysis of state-of-the-art methods for video-based assessment of surgical skill in the operating room.

Methods:

Using a dataset of 99 videos of capsulorhexis, a critical step in cataract surgery, we evaluated image feature-based methods and two deep learning methods to assess skill using RGB videos. In the first method, we predict instrument tips as keypoints, and predict surgical skill using temporal convolutional neural networks. In the second method, we propose a frame-wise encoder (2D convolutional neural network) followed by a temporal model (recurrent neural network), both of which are augmented by visual attention mechanisms. We computed the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and predictive values through 5-fold cross-validation.

Results:

To classify a binary skill label (expert vs. novice), the range of AUC estimates was 0.49 (95% confidence interval; CI=0.37 to 0.60) to 0.76 (95% CI=0.66 to 0.85) for image feature-based methods. The sensitivity and specificity were consistently high for none of the methods. For the deep learning methods, the AUC was 0.79 (95% CI=0.70 to 0.88) using keypoints alone, 0.78 (95% CI=0.69 to 0.88) and 0.75 (95% CI=0.65 to 0.85) with and without attention mechanisms, respectively.

Conclusion:

Deep learning methods are necessary for video-based assessment of surgical skill in the operating room. Attention mechanisms improved discrimination ability of the network. Our findings should be evaluated for external validity in other datasets.

Keywords: video-based assessment, surgical skill, deep learning, cataract surgery

1. Introduction

Surgeons’ skill in the operating room affects patient outcomes [5]. Interventions to optimize surgeons’ skill can improve quality of patient care. Assessment of skill is a cornerstone for any intervention to improve it. Surgical skill assessment is essential for other purposes throughout surgeons’ career including in-training evaluations, surgical coaching, certification, re-certification, and credentialing of surgeons, as well as end of career decisions [18].

Traditionally, surgical skill assessment was based upon direct observation of surgeons in the operating room [29]], which was subjective and unreliable. Unlike direct observation, video-based assessment (VBA) enables asynchronous objective evaluation of surgeons’ skill, provides surgeons with formative assessments, and enables surgical coaching among other applications [28, 24].

Currently, VBA of surgical skill in the operating room is obtained either from ratings by peer surgeons or crowd raters (i.e., crowd- sourcing) [24]. Manual VBA of surgical skill is subjective and unreliable. Crowd-sourced skill ratings correlate with expert ratings [16, 19, 23], but some studies show low predictive value for the ratings [1, 8]. Consequently, despite the efficiency with which skill ratings can be obtained with crowdsourcing, its role in routine assessments of surgical skill is not established. Furthermore, VBA by peers relies upon access to experienced raters and thus, it is inefficient for routine use.

Surgical data science methods can provide objective and accurate VBA of surgical skill [18]. Previously, we developed and evaluated a temporal convolutional neural network (TCN) to assess skill using manually annotated instrument tips in video images [13]. Our objective in this study is to develop and validate methods for assessment of surgical skill directly from videos of the surgical field. Using a dataset of 99 videos, we evaluated algorithms to assess surgical skill with a uniform cross-validation setup and evaluation metrics. For VBA of skill in the operating room, to our knowledge, this dataset is the largest in the literature. The contributions of our work include:

  1. A comprehensive evaluation of feature-based methods for VBA of intraoperative surgical skill;

  2. Methods for surgical skill assessment directly from RGB videos;

  3. A novel architecture for a deep learning method augmented with attention to analyze surgical videos, towards explaining the predictions.

2. Methods

2.1. Feature based approaches

We evaluate five approaches to obtain features from videos, which we then analyze with different classifiers, including support vector machines (SVMs) with linear and radial basis frequency kernels, logistic regression, and a multilayer perceptron (MLP).

2.1.1. Detectors

Following [15], we use spatiotemporal interest points (STIPs) to identify regions in images that are key to developing useful feature representations.

2.1.2. Descriptors

We compute three different descriptors for each STIP: histogram of oriented gradients (HoG) to encode local spatial information in image patches [7], histogram of optical flow (HoF) to describe localized flow in videos [7], and motion boundary histograms (MBHs) to encode localized temporal information in X and Y components of the differential flow [35].

2.1.3. Features

Bag of Words (BoW)

We cluster descriptors of STIPs in a video using a k-means algorithm to obtain a visual feature vocabulary, and use TF-IDF to compute the BoW feature [25].

Augmented Bag of Words (Aug. BoW)

To address lack of temporal information in BoW, we create a temporal vocabulary by classifying the time between STIPs into N bins. We concatenate the visual and temporal vocabularies and compute n-grams. The n-grams are thus a sequence of events and sum of their times. This is a useful representation for surgical movements that take relatively longer periods of time and continue over a number of frames. We use TF-IDF to represent n-grams [4].

Discrete Fourier Transform (DFT) / Discrete Cosine Transform (DCT)

DFT and DCT extract motion information i.e., the frequencies of the different surgical action cluster categories [33]. A transformation matrix or motion class matrix (MCM) is created from the learned clusters of the visual vocabulary using K-means, so that if a video has N frames, a KxN MCM is created that contains for each row i=1,2,,K, the counts of how many interest points in the nth frame belong to the cluster K. The DFT and DCT of this matrix are then calculated, where each entry in the calculated KxN matrix represents the nth frequency coefficient of the kth cluster. This matrix measures the repetitiveness of the different surgical actions in the videos. Since higher frequencies are a result of noisy or abrupt movements, the lowest D frequencies are used, and we get a KxD matrix. This matrix is flattened to get a K*D feature vector.

Sequential Motion Textures (SMT)

In SMT, spatiotemporal information is encoded in the form of gray level cooccurrence matrices (GLCMs) that are derived from frame kernel matrices of the MCM [26]. They represent the affinity of the elements of a matrix with respect to all the elements. The frame kernel matrices are obtained by splitting the MCM into time windows of specified width (W), which allows temporal information to be considered over a fixed time interval according to the length of the videos. This is followed by applying a radial basis frequency kernel and shifting the domain to grayscale. Texture patterns are then extracted from these matrices using GLCMs for different gray levels Ng, which encode both spatial relationships and motion dynamics in the surgical videos. After calculating the averaged and normalized GLCMs for the windows, 20 standard texture features are selected using sequential forward feature selection; this results in Wx20 features for each video [20].

Approximate Entropy (ApEn) / Cross Approximate Entropy (XApEn)

ApEn and XApEn are methods that construct entropy-based features that measure the amount of entropy in a given time-series input [34]. These features are expected to recognize predictable and regular patterns in time-series which in turn leads to better skill assessment from video-based descriptors and accelerometer data [34]. For ApEn, each time series is split into embedding factors (time windows) according to a time delay, and then for each embedding factor the frequency of repeatable patterns is calculated by summing up the Heaviside functions of the L-Infinity norm distance MCMs [34]. XApEn calculation is similar to ApEn, except that the the L-Infinity norm distance is calculated between embedding factors for all possible cluster pairs.

2.2. Tool detection approach for skill assessment

Prior work shows that tool motion data is useful for surgical skill assessment [13], and tooltip annotations in video images can be obtained through crowd-sourcing. To avoid the need for tooltip annotation, we learn to detect tooltips as keypoints of objects in video and analyze the predicted keypoints using a TCN (KP-TCN).

2.2.1. Detection of surgical instrument tips

We use a TCN described in [13]. We include an additional stage to infer tooltip locations directly from images. We model surgical tool tip locations as keypoints of objects. Let X=x1,x2,,xN be a video with N images. We assume there exists annotations regarding tooltip locations yn=p1n,,pKn for each frame xn where pknR2 is the pixel location of the k-th keypoint in frame n. We learn a keypoint detector that minimizes the following objective:

{q1n,,qKn}=(xn)keypoints=n=1Nk=1Kd(qkn,pkn) (1)

where d is a distance between the keypoints and qknR2 is the keypoint prediction given xn. In this work, we use a convolutional neural network based keypoint detector for and train it in an end-to-end manner.

2.2.2. TCNs

Using , we encode a video X=x1,x2,,xN as a time-series of keypoint predictions Z such that:

Z=z1,z2,,zNRN×2K (2)

where zn=xnR2K is the predicted pixel locations of K keypoints in frame xn.

We calculate tooltip velocities as the difference in predicted keypoint locations between successive frames as in [13].

The final input to the TCN is represented as Z such that

Z=δz1,δz2,,δzN (3)

where δzn=zn+1-zn is the tooltip velocity at frame n.

We follow the TCN design proposed in [13]. For the lth layer of the TCN with Fl 1-D convolutional filters, the output activation can be written as

hl=σWl*hl-1 (4)

where hl is the output of the l-th layer, Wl are the convolutional filter weights of the lth layer, σ() is a non-linearity (ReLU followed by batch-normalization operation) and * represents convolution. We train the TCN in an end-to-end manner using standard cross-entropy loss.

2.3. Dual attention network (ATT)

The TCN omits important contextual information such as visual changes in anatomy and instrument/anatomy interactions. To address this limitation, we present an attention based method for VBA directly from RGB videos. Past studies have shown that attention mechanisms generate better context vectors that localize to relevant parts of the input for the task [3]. We hypothesize that including attention in our model would enable us to localize time-frames and locations in the video which might be useful for classification. Our architecture includes a feature extractor and a LSTM module, both equipped with separate but dependent spatial and temporal attention mechanisms as explained below.

2.3.1. Video Representation

First, a ResNet encoder [11] is used to extract features from a given sequence of RGB frames. Then, a LSTM cell [12] operates on those features to generate the final classification. We augment both the ResNet encoder and the LSTM Cell with attention [30], [31].

For the spatial domain, the ResNet returns D feature maps of size H×W for each image in a video. The output of the feature extractor is represented by

V=F1,F2,FN,VRN×D×H×W (5)

where FnRD×H×W is a spatial feature and N is the number of frames in video. The frame features in turn consist of a D-dimensional positional activation map, represented as follows:

Fn=an1,an2,anL,aniRD (6)

where D is the dimension of the feature vector for each position in the image, and L=H×W is the number of positions in the image.

2.3.2 Spatial and Temporal Attention

The frame feature vectors Fn are passed sequentially through the LSTM cell, and then both the hidden state of the LSTM cell and the frame feature vectors are used to compute the attention weights as follows:

eni=fattFn,hn-1=fattan1,an2,anL,hn-1αni=expenij=1Lexpenj (7)

where αni are the positional attention weights, and fatt is the spatial attention module, as described below.

attf=Fn×Wfattl=hn-1×WhfattFn,hn-1=ReLUattf+attl×Wc (8)

where Wf,Wh and Wc represent linear layers. Once the attention weights αni are computed, the final context vector zn can be computed as follows:

zn=Σi=1Lαniani (9)

The attention weighted frame encoding vectors are then used to compute the attended feature vector from the LSTM outputs as shown in eq. 10.

Z=z1,z2,zNM=hN×tanh(Z)β=softmax(M)r=Hβ (10)

where zi is the spatial attention weighted frame encoding for frame i,M are the soft alignment scores, β are the temporal attention weights, H is the combined LSTM output for all frames, and r is the final temporal attention weighted encoding for the entire video.

Finally, the attended feature encoding is input to a linear classification layer to generates the final classification. We use cross-entropy loss and train the models for 1000 epochs with a step decay on plateau using stochastic gradient descent with an initial learning rate of 1e − 2.

3. Dataset

We used a dataset of 99 videos of capsulorhexis (a critical step in cataract surgery), which is identical to that used in [13]. The Johns Hopkins Medical Institutions Institutional Review Board approved this study. A faculty surgeon operated in 28 instances and a trainee surgeon operated in 71 instances. We processed videos captured from the operating microscope to a resolution of 640*480 at 59 frames per second.

One expert surgeon assigned a rating for skill using the International Council of Ophthalmology’s Ophthalmology Surgical Competency Assessment Rubric for phacoemulsification (ICO-OSCAR:phaco) [9]. ICO-OSCAR:phaco includes two items to assess skill for capsulorhexis corresponding to surgical two goals in this step — commencement of flap & follow-through (CF), and capsulorhexis formation and completion (RF). Each item is rated on a scale of 2–5.

4. Experiments

We performed two experiments – predicting a binary skill class (expert / novice), and predicting a score on CF and RF. For the binary skill class label, videos assigned a score of 5 on at least one item for capsulorhexis in ICO-OSCAR:phaco and at least a score of 4 on the other item were labeled expert; otherwise, the videos were labeled novice. The 3-class label for CF and RF comprised a score of 2, 3, 4, or 5. We collapsed scores of 2 and 3 into one class for our analysis. We used the same five-fold cross-validation folds setup for both the experiments (Table 1). Holding each fold out at a time for testing, we iteratively used one of the remaining four folds for validation and trained on the rest of the three folds. We used the model configuration and hyperparameters for which accuracy on the validation fold was highest.

Table 1.

Distribution of groundtruth across cross-validation folds.

Fold 1 2 3 4 5

Binary Skill Class

Expert 10 10 10 10 12
Novice 9 9 9 9 11

CF

CF-2/3 4 1 3 4 2
CF-4 8 9 6 6 10
CF-5 7 9 10 9 11

RF

RF-2/3 6 4 5 5 3
RF-4 3 6 6 6 12
RF-5 10 9 8 8 8

CF = Commencement of flap & follow-through;

RF = Rhexis formation and completion

4.1. Implementation

4.1.1. Feature based approaches

To compute STIPs, we use windows of 6 seconds with a 2 second overlap on each side. The frame rate for input videos is 59 fps, resulting in 520 frames for each window. We use a 3×3×3 Gaussian kernel with spatial variance of 4 and temporal variance of 8 for the convolution. We use a Gaussian kernel with variance 1 to smooth the response function. We used the top 1000 STIPs.

For HoGs, we extracted a 9×9×9 patch centered on each STIP, split it into 9 windows, and computed 8 gradient orientations resulting in a 72-dimensional feature vector. For HoFs, we used a 17×17×17 image patch centered on each STIP. We computed optical flow within the patch, and split it into 9 windows to compute 9 gradient orientations to obtain a 81-dimensional vector. For MBHs, we used 17×17 patches separately in the X and Y components of the optical flow that we split into 25 windows and compute 8 gradient orientations (HoGs). This gives a 400-dimensional vector (after concatenating the vector for X and Y components). We concatenated all three descriptors resulting in a 513-dimensional feature for each video.

For k-means in Aug. BoW, we evaluated K=25, 50, and 100 and using Euclidean and Mahalanobis distance. We computed n-grams using interspersed, cumulative, or pyramid encoding, where n=3 and 5. In interspersed encoding, the temporal information is the time between events (or cluster labels), and this is encoded in sequence of the cluster labels occurring in the videos. This is useful for short surgical movements that take small amounts of time and are independent in the frames of the videos. Cumulative encoding encodes temporal information over a (user specified) sequence of events, and sums this up. Pyramid encoding creates l-grams for all of the specified n-grams from 1 to n. This can be interpreted as breaking down surgical movements and representing them at different levels. We used 5 bins to create the temporal vocabulary, following previous recommendation [4, 32].

For DFT and DCT, the frequencies for each cluster time series in the MCM are calculated, and the 50 lowest frequencies are selected from each to ignore noisy frequencies. This gives a K∗50 vector for each of DFT and DCT, resulting in a 2×K×50 dimensional vector. The top 30 features are then selected using the SFFS feature selection method, yielding a final 30-dimensional frequency feature vector. SFFS works by evaluating the effectiveness of each individual feature using a classifier, and selecting the top n (specified by user) features.

For SMT, we analyzed multiple frame kernel matrix variances including 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, and 1e-3. We evaluated multiple window sizes (W) of 2, 4, 8, and 16. We also analyzed multiple gray levels (Ng) of 8, 16, 32, 64, 128, and 256.

For entropy-based methods, we combined features from ApEn and XApEn. The dimensionality of features for ApEn is r*K, and for XApEn is r*K*(K-1)/2, where K is the number of clusters, and ApEn and XApEn features are concatenated to produce a single feature vector. We have τ fixed at 1 (according to [32]), and m taking values of 1 and 2. rcoeff takes values 0.1, 0.15, 0.2, 0.25, where r=rcoeff*σ where σ is the standard deviation of the time series.

We implemented logistic regression with L2 penalty. We chose hyperparameters for the classifiers, including regularization for SVMs and logistic regression, and the number of hidden layers in the MLP, using the validation fold.

4.2. Deep learning approaches

All deep learning approaches are implemented using the PyTorch [22] framework and trained using K80 GPUs.

For tool detection, we trained the HR-Net [27] using tool tip annotations of [13]. We use standard image augmentations during training which includes horizontal flip, shift-scale-rotate, brightness and Contrast adjustments. We follow the optimization proposed in [27] and use weighted focal loss [17] with a positive weight of 3 and 1-cycle learning rate schedule with an initial learning rate of 1e-2. We used the Adam optimizer for training.

For TCNs, we use the Adam optimizer with a learning rate of 1e-3 and models train for 50 epochs using standard cross-entropy loss [14]. To train the dual attention network, we sample video snippets of 64 frames, sampled with a stride of 4 frames from each video. The image frames are normalized, and flip and rotate data augmentations are applied. The models are trained for 100 epochs with Adam optimizer with an initial learning rate of 1e-2 that is lowered by a factor of 10 as validation loss plateaus. We used standard cross-entropy loss.

At test time using dual attention networks, a video is represented as clips of 64 frames sampled 4 frames apart to be consistent with training. The clips overlap by 32 frames. Predictions are made individually on all the clips and averaged to produce the final output.

We also evaluated a network trained with without attention (No ATT), using the same implementation described above for the dual attention network.

4.3. Statistical Analyses

To evaluate algorithm performance in each experiment, we computed the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive and negative predictive values (PPV, NPV), and accuracy (both micro- and macro-accuracy [13]). We used bootstrap to compute 95% confidence intervals (CI) for AUC and computed the Wilson interval for the remaining measures [2]. We used the method described in [10] to estimate AUC and bootstrap to compute 95% CI for alogrithms predicting a score on each item for capsulorhexis in ICO-OSCAR:phaco.

5. Results

Feature based approaches

To predict an expert/novice label, findings using a linear SVM classifier are shown in Figure 1, Figure 2, and Table 2. Findings for other classifiers are shown in Table 2 in the Supplementary Material.

Fig. 1.

Fig. 1

ROC plots for interest-point based methods.

Numbers on plots are AUC and 95% confidence intervals.

Fig. 2.

Fig. 2

Predictive performance of interest-point based methods.

Table 2.

Estimates of accuracy of deep learning methods for scores on individual items in ICO-OSCAR:phaco for capsulorhexis (95% confidence intervals in parentheses).

Item AUC Micro accuracy Macro accuracy
KP: predicted keypoints analyzed with a TCN
CF 0.67 (0.59 to 0.72) 0.65 (0.73 to 0.55) 0.76 (0.67 to 0.84)
RF 0.72 (0.63 to 0.78) 0.60 (0.69 to 0.50) 0.73 (0.64 to 0.81)
ATT: neural network with attention mechanisms
CF 0.71 (0.64 to 0.77) 0.64 (0.72 to 0.54) 0.76 (0.67 to 0.83)
RF 0.65 (0.58 to 0.70) 0.48 (0.58 to 0.39) 0.66 (0.56 to 0.74)

KP = predicted keypoints analyzed with a temporal convolutional neural network (TCN); ATT = neural network with attention mechanisms. CF = Commencement of flap & follow through; RF = formation and completion.

None of the features enabled consistently high measures of performance for any of the classifiers. The estimates of AUC were 0.75 or greater for some features and classifiers, suggesting no meaningful difference from deep learning models. However, classifier performance was not uniform for positive (sensitivity) and negative labeled instances (specificity). The utility of the classifiers is affected by low values of sensitivity or specificity.

Among the features we computed, the AUC across classifiers was consistently low (0.49 to 0.54) for SMT, and high for Aug. BoW (0.64 to 0.75) and ApEN (0.67 to 0.73). Furthermore, none of the features yielded a classifier with uniform sensitivity and specificity. These findings suggest that there is limited information in the features we evaluated for surgical skill assessment using intraoperative videos.

To predict a 3-class label for CF and RF using a linear SVM, estimates of AUC for all the feature based methods were consistent with the null value, although estimates of micro- and macro-accuracy, sensitivity, and specificity indicate that algorithm performance may have been affected by the class imbalance in our dataset (Tables 3 and 4 in Supplementary Material).

Deep learning methods

To predict an expert / novice label, the attention-based network had high AUC, in addition to higher sensitivity and specificity than the other two deep learning methods (Figures 3 and 4). KP had similar AUC as the attention-based network, but it had lower sensitivity and specificity.

Fig. 3.

Fig. 3

ROC plots for deep learning methods.

Numbers on plots are AUC and 95% confidence intervals.

Fig. 4.

Fig. 4

Predictive performance of deep learning methods.

For predicting a 3-class label for CF and RF, estimates of AUC were lower than that for expert / novice label prediction (Table 2). In addition, these methods had higher sensitivity and lower specificity for labels indicating better skill, and lower sensitivity and higher specificity for labels indicating poor skill (Table 3). This observation, together with estimates of micro- and macro-accuracy suggest that performance of deep learning methods was affected by class imbalance in CF and RF in our dataset.

Table 3.

Estimates of performance measures for scores on individual items in ICO-OSCAR:phaco for capsulorhexis (95% confidence intervals in parentheses).

Item Sensitivity Specificity PPV NPV
KP: predicted keypoints analyzed with a TCN
CF = 2/3 0.00 (0.00 to 0.22) 0.99 (0.94 to 1.00) 0.00 (0.00 to 0.95) 0.86 (0.77 to 0.91)
CF = 4 0.64 (0.48 to 0.77) 0.72 (0.59 to 0.81) 0.60 (0.44 to 0.73) 0.75 (0.63 to 0.85)
CF = 5 0.85 (0.72 to 0.92) 0.68 (0.55 to 0.79) 0.70 (0.57 to 0.80) 0.84 (0.70 to 0.92)
RF = 2/3 0.35 (0.19 to 0.55) 0.93 (0.86 to 0.97) 0.62 (0.36 to 0.82) 0.83 (0.73 to 0.89)
RF = 4 0.45 (0.30 to 0.62) 0.80 (0.69 to 0.88) 0.54 (0.36 to 0.70) 0.75 (0.63 to 0.83)
RF = 5 0.84 (0.70 to 0.92) 0.61 (0.48 to 0.72) 0.62 (0.49 to 0.73) 0.83 (0.69 to 0.91)
ATT: neural network with attention mechanisms
CF = 2/3 0.07 (0.00 to 0.31) 0.99 (0.94 to 1.00) 0.50 (0.03 to 0.97) 0.87 (0.78 to 0.92)
CF = 4 0.62 (0.46 to 0.75) 0.73 (0.61 to 0.83) 0.60 (0.45 to 0.74) 0.75 (0.62 to 0.84)
CF = 5 0.83 (0.69 to 0.91) 0.64 (0.51 to 0.76) 0.67 (0.54 to 0.78) 0.81 (0.67 to 0.90)
RF = 2/3 0.17 (0.07 to 0.37) 0.87 (0.77 to 0.93) 0.29 (0.12 to 0.55) 0.78 (0.68 to 0.85)
RF = 4 0.36 (0.22 to 0.53) 0.71 (0.59 to 0.81) 0.39 (0.24 to 0.56) 0.69 (0.57 to 0.79)
RF = 5 0.74 (0.60 to 0.85) 0.61 (0.48 to 0.72) 0.59 (0.46 to 0.71) 0.76 (0.61 to 0.86)

KP = predicted keypoints analyzed with a temporal convolutional neural network (TCN); ATT = neural network with attention mechanisms. CF = Commencement of flap & follow through; RF = formation and completion; PPV = positive predictive value; NPV = negative predictive value.

6. Discussion

Our work is a comprehensive evaluation of state-of-the-art methods for VBA of surgical skill in the operating room. Our findings show that deep learning methods perform better than most feature based methods in terms of AUC. Furthermore, a network using attention mechanisms had the most desirable performance measures for VBA of skill directly from RGB videos of the surgical field. Even when compared with a network trained using precise manual annotations of instrument tips [13], we observed higher sensitivity (0.843 vs. 0.824) and specificity (0.75 vs. 0.71) with the attention-based network. These findings indicate that it is useful to analyze the entire context in the surgical field to assess skill instead of analyzing instrument motion alone.

Our findings from analysis of predicted instrument tips (KP) in this study reinforce our prior observation that instrument motion in capsulorhexis can be used to discriminate surgical skill [13]. However, performance of algorithms using predicted keypoints were lower than that obtained from using precise manual annotations (AUC 0.79 vs. 0.86). While it is likely that larger datasets can improve accuracy in predicted keypoints, and subsequently in skill assessment, future research should consider methods to encode additional context in the surgical field that is not limited to instrument motion.

Among the feature based methods, Aug. BoW, a simple method to analyze overall temporal information, appeared to yield a better linear SVM classifier than the other methods. Aug. BoW involves encoding of overall temporal information either at a low- or high-level depending on the type of encoding. On the other hand, DT, SMT, and ApEn involve extracting specific types of temporal features. This specificity limits their utility in a complex real-world environment of the operating room as opposed to a controlled simulated environment in which these methods were developed [32, 34]. However, none of the features led to a classifier that was as accurate as the deep learning methods with consistent sensitivity and specificity. Furthermore, large variances in estimates of measures of discrimination observed in our study for the feature based methods indicate that more data may be necessary to learn a useful classifier with the high-dimensional features. Despite the possibility that larger datasets may result in more accurate feature based methods, there is little evidence to suggest that they will generalize to new datasets and be useful for surgical skill assessment in the operating room. Moreover, deep learning methods may lead to more potent discriminative classifiers when larger datasets are available.

Performance of algorithms for VBA should be evaluated in the context of the intended application. VBA of surgical skill in the operating room has multiple applications, each with different stakes or consequences. Surgical skill is associated with patient outcomes, therefore, interventions to improve surgeons’ skill can advance quality of care. Skill assessment is key to training surgeons. It supports deliberate practice, and provides summative evaluation at the end of rotations, at the end of each year of training, etc. Some high-stakes applications of surgical skill assessment include certification and re-certification of surgeons for independent practice, determination of operating privileges, and end of career decisions. Not all applications of VBA of surgical skill require the same algorithm performance profile. For example, applications with significant consequences of a false positive, such as certification of surgeons, may require higher specificity (and NPV) than sensitivity (and PPV). In fact, deep learning algorithms in our study, if shown to be externally valid, may not have sufficient specificity for high-stakes assessments, but they may be useful for routine training curricula.

Our study has a few limitations, besides a limited amount of data, that provide directions for future research. Our analyses did not account for multiple videos from the same surgeons, thus, we may have overestimated algorithm performance. Imbalance in distribution of groundtruth labels may have adversely affected algorithm performance. We did not analyze sensitivity of algorithm performance to video resolution. Capsulorhexis is performed through microscopic surgical actions, therefore, videos with a high resolution may enable analysis of granular data patterns. To utilize the learned attention maps for an interpretable assessment of skill and generation of actionable feedback for the surgeon, a qualitative analysis of the usability, feasibility, and correctness of attention maps is necessary. Though our approach uses attention maps to implicitly localize relevant parts of the video without verification, a structured analysis of the acquired attention maps is an interesting direction for future research.

7. Conclusion

Deep learning methods are necessary for VBA of surgical skill in the operating room. While our findings show internal validity of deep learning methods for VBA of surgical skill in capsulorhexis, testing them in additional datasets is necessary to establish their external validity.

Supplementary Material

Supplementary material

Declarations

Acknowledgments: Dr. Austin Reiter mentored this work in its early stages.

Funding: Drs. Vedula, Sikder, and Hager are supported by a grant from the National Institutes of Health, U.S.A.; NIH 1R01EY033065. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Footnotes

Competing Interests: None.

References

  • [1].Aghdasi Nava, Bly Randall, W White Lee, Hannaford Blake, Moe Kris, and S Lendvay Thomas. Crowd-sourced assessment of surgical skills in cricothyrotomy procedure. Journal of surgical research, 196(2):302–306, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Agresti Alan. Categorical data analysis, volume 482. John Wiley & Sons, 2003. [Google Scholar]
  • 3.[] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, 2015. [Google Scholar]
  • [4].Bettadapura Vinay, Schindler Grant, Plötz Thomas, and Essa Irfan. Augmenting bag-of-words: Data-driven discovery of temporal and structural information for activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2619–2626, 2013. [Google Scholar]
  • [5].Birkmeyer John D, Finks Jonathan F, O’Reilly Amanda, Oerline Mary, Carlin Arthur M, Nunn Andre R, Dimick Justin, Banerjee Mousumi, and Birkmeyer Nancy JO. Surgical skill and complication rates after bariatric surgery. New England Journal of Medicine, 369(15):1434–1442, 2013. [DOI] [PubMed] [Google Scholar]
  • [6].Carroll Noel and Richardson Ita. Software-as-a-medical device: demystifying connected health regulations. Journal of Systems and Information Technology, 2016. [Google Scholar]
  • [7].Dalal Navneet, Triggs Bill, and Schmid Cordelia. Human detection using oriented histograms of flow and appearance. In European conference on computer vision, pages 428–441. Springer, 2006. [Google Scholar]
  • [8].Deal Shanley B, Stefanidis Dimitrios, Telem Dana, Fanelli Robert D, McDonald Marian, Ujiki Michael, Brunt L Michael, and Alseidi Adnan A. Evaluation of crowd-sourced assessment of the critical view of safety in laparoscopic cholecystectomy. Surgical endoscopy, 31(12):5094–5100, 2017. [DOI] [PubMed] [Google Scholar]
  • [9].Golnik C, Beaver Hilary, Gauba Vinod, Lee Andrew G, Mayorga Eduardo, Palis Gabriela, and Saleh George M. Development of a new valid, reliable, and internationally applicable assessment tool of residents’ competence in ophthalmic surgery (an american ophthalmological society thesis). Transactions of the American Ophthalmological Society, 111:24, 2013. [PMC free article] [PubMed] [Google Scholar]
  • [10].Hand David J and Till Robert J. A simple generalisation of the area under the roc curve for multiple class classification problems. Machine learning, 45(2):171–186, 2001. [Google Scholar]
  • [11].He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [Google Scholar]
  • [12].Hochreiter Sepp and Schmidhuber Jürgen. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [DOI] [PubMed] [Google Scholar]
  • [13].Kim Tae Soo, O’Brien Molly, Zafar Sidra, Hager Gregory D, Sikder Shameema, and Vedula S Swaroop. Objective assessment of intraoperative technical skill in capsulorhexis using videos of cataract surgery. International journal of computer assisted radiology and surgery, 14(6):1097–1105, 2019. [DOI] [PubMed] [Google Scholar]
  • [14].Kingma Diederik and Ba Jimmy. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014. [Google Scholar]
  • [15].Laptev Ivan. On space-time interest points. International journal of computer vision, 64(2–3):107–123, 2005. [Google Scholar]
  • [16].Lendvay Thomas S, White Lee, and Kowalewski Timothy. Crowdsourcing to assess surgical skill. JAMA surgery, 150(11):1086–1087, 2015. [DOI] [PubMed] [Google Scholar]
  • [17].Lin T, Goyal P, Girshick R, He K, and Dollár P. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. [Google Scholar]
  • [18].Maier-Hein L, Vedula SS, Speidel S, Navab N, Kikinis R, Park A, Eisenmann M, Feussner H, Forestier G, Giannarou S, Hashizume M, Katic D, Kenngott H, Kranzfelder M, Malpani A, März K, Neumuth T, Padoy N, Pugh C, Schoch N, Stoyanov D, Taylor R, Wagner M, Hager GD, and Jannin P. Surgical data science for next-generation interventions. Nature Biomedical Engineering, 1(9):691–696, 2017. [DOI] [PubMed] [Google Scholar]
  • [19].Malpani Anand, Vedula S Swaroop, Chen Chi Chiung Grace, and Hager Gregory D. A study of crowdsourced segment-level surgical skill assessment using pairwise rankings. International journal of computer assisted radiology and surgery, 10(9):1435–1447, 2015. [DOI] [PubMed] [Google Scholar]
  • [20].Marcano-Cedeno A, Quintanilla-Domínguez J, Cortina-Januchs MG, and Andina D. Feature selection using sequential forward selection and classification applying artificial metaplasticity neural network. In IECON 2010–36th annual conference on IEEE industrial electronics society, pages 2845–2850. IEEE, 2010. [Google Scholar]
  • [21].Pandey VA, Wolfe JHN, Black SA, Cairols M, Liapis CD, and Bergqvist D. Self-assessment of technical skill in surgery: the need for expert feedback. The Annals of The Royal College of Surgeons of England, 90(4):286–290, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. Pytorch: An imperative style, high-performance deep learning library. In Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. [Google Scholar]
  • [23].Powers MK, Boonjindasup A, Pinsky M, Dorsey P, Maddox M, Su LM, Gettman M, Sundaram CP, Castle EP, Lee JY, and Lee BR. Crowd-sourcing assessment of surgeon dissection of renal artery and vein during robotic partial nephrectomy: a novel approach for quantitative assessment of surgical performance. Journal of endourology, 30(4):447–452, 2016. [DOI] [PubMed] [Google Scholar]
  • [24].Pugh Carla, Hashimoto Daniel A, and Korndorffer James R Jr. The what? how? and who? of video based assessment. The American Journal of Surgery, 2020. [DOI] [PubMed] [Google Scholar]
  • [25].Robertson Stephen. Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation, 2004. [Google Scholar]
  • [26].Sharma Yachna. Surgical skill assessment using motion texture analysis. PhD thesis, Georgia Institute of Technology, 2014. [Google Scholar]
  • [27].Sun Ke, Xiao Bin, Liu Dong, and Wang Jingdong. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5693–5703, 2019. [Google Scholar]
  • [28].Valanci-Aroesty Sofia, Alhassan Noura, Feldman Liane S, Landry Tara, Mastropietro Victoria, Fiore Julio Jr, Lee Lawrence, Fried Gerald M, and Mueller Carmen L. Implementation and effectiveness of coaching for surgeons in practice–a mixed studies systematic review. Journal of Surgical Education, 2020. [DOI] [PubMed] [Google Scholar]
  • [29].Vedula S Swaroop, Ishii Masaru, and Hager Gregory D. Objective assessment of surgical technical skill and competency in the operating room. Annual review of biomedical engineering, 19:301–325, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015. [Google Scholar]
  • [31].Zhou Peng, Shi Wei, Tian Jun, Qi Zhenyu, Li Bingchen, Hao Hongwei, and Xu Bo. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 207–212, 2016. [Google Scholar]
  • [32].Zia Aneeq and Essa Irfan. Automated surgical skill assessment in rmis training. International Journal of Computer Assisted Radiology and Surgery, 13(5):731–739, 2018. [DOI] [PubMed] [Google Scholar]
  • [33].Zia Aneeq, Sharma Yachna, Bettadapura Vinay, Sarin Eric L, Clements Mark A, and Essa Irfan. Automated assessment of surgical skills using frequency analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 430–438. Springer, 2015. [Google Scholar]
  • [34].Zia Aneeq, Sharma Yachna, Bettadapura Vinay, Sarin Eric L, and Essa Irfan. Video and accelerometer-based motion analysis for automated surgical skills assessment. International journal of computer assisted radiology and surgery, 13(3):443–455, 2018. [DOI] [PubMed] [Google Scholar]
  • [35].Zia Aneeq, Sharma Yachna, Bettadapura Vinay, Sarin Eric L, Ploetz Thomas, Clements Mark A, and Essa Irfan. Automated video-based assessment of surgical skills for training and evaluation in medical schools. International journal of computer assisted radiology and surgery, 11(9):1623–1636, 2016. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

RESOURCES