Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jun 9.
Published in final edited form as: Curr Probl Surg. 2022 Feb 10;59(6):101125. doi: 10.1016/j.cpsurg.2022.101125

Artificial Intelligence in Surgery: A Research Team Perspective

Hossein Mohamadipanah 1, Calvin Perumalla 2, Su Yang 3, Brett Wise 4, LaDonna Kearse 5, Cassidi Goll 6, Anna Witt 7, James R Korndorffer Jr 8, Carla Pugh 9
PMCID: PMC10254575  NIHMSID: NIHMS1779249  PMID: 35690434

ARTIFICIAL INTELLIGENCE IN SURGERY

Why Do We Talk About Artificial Intelligence?

The rise of artificial intelligence (AI) applications in the surgical domain necessitates familiarity of the medical community with the concepts and applications of AI. In this monograph, we present a comprehensive introduction to AI including the history, the general concepts, and use as well as specific examples in surgery.

Definition of Artificial Intelligence

AI is defined as “the theory and development of computer systems able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.”1AI is not a singleentity, but rather a collection of computer systems (algorithms) designed for different applications. In this monograph, we focus on the common applications of AI in the surgical domain and elaborate on underlying concepts and terminologies

Biological Inspiration of Artificial Neural Networks

To develop artificial, human-like intelligence, the major stream of research is based on the creation of computer systems inspired by human brain physiology.2Neurons, the fundamental units of the brain, are essential for receiving signals and transferring information to other cells. A schematic drawing of a neuron consisting of 3 main sections—dendrites, a cell body, and an axon—is shown in Figure 1. The function of a cell body is to sum the incoming signals and decide to pass it to another neuron if it is higher than a threshold value. The strengthening and weakening of synaptic junctions (ie, between a dendrite and the axon) is known to play a significant role in the process of human learning. The same concept is applied when considering artificial neural networks (NNs).

Figure 1.

Figure 1.

Schematic drawing of a biological neuron (top). Artificial neuron where W1 and W2 represent synaptic junctions (bottom).

Figure 1 also depicts an artificial neuron in which W1 and W2 are called “weights,” where the value of these weights represents the strengthening and weakening of synaptic junctions. In an artificial cell body, the input values X1 and X2 are each multiplied first by their corresponding weights (W1 and W2) and then summed and passed to a transfer function. Although in a biological sense, a transfer function should only pass a signal after reaching a threshold whereas in an artificial neuron, many different types of functions have been used.

The human brain contains an estimated 100 billion neurons, each with an average of 10,000 connections.3Each of these neurons receives input to their dendrites from surrounding neurons. In a similar manner, many interconnected artificial neurons are called artificial NNs. A typical NN where artificial neurons were organized in layers (layers 1 to 5) is shown in Figure 2. Each artificial neuron receives input from the output of all neurons in the previous layer.

Figure 2.

Figure 2.

Neural network learning process.

The Learning Process: Nuts and Bolts of Neural Networks

The concept of “learning” within NNs also has some parallels to its biological counterpart. Like how a biological neuron may strengthen or weaken its synaptic connections to improve functionality, a NN adjusts its weight according to observed data points. The learning process is depicted in Figure 2. After each input to the network, it makes a prediction of the desired output. The error of that prediction is calculated with respect to the expected output, and then weights are adjusted to minimize the error. This process is called “backpropagation.” After adjusting the weights, the network will be ready to stand alone as a function approximator that can predict outputs of unseen inputs.

The field of study to develop AI algorithms that use the idea of adjusting weights to learn is called “machine learning” (ML). It is found that a higher number of layers in a NN leads to more capabilities in modeling complex functions. These networks are called “deep neural networks” (DNNs), where “deep” refers to many layers of the network. The field of developing algorithms of deep neural networks is called “deep learning” (DL). In addition, the term “computer vision” (CV) refers to the field of study that focuses on gaining a high-level understanding from digital images and videos. The relationship between these 4terms is shown in Figure 3. In general, the field of DL is a subset of ML which itself is a subset of AI. Additionally, CV can be considered a subset of AI, overlapping with ML and DL.

Figure 3.

Figure 3.

The relationship between artificial intelligence (AI), machine learning (ML), deep learning (DL), and computer vision (CV).

Data Division in Learning Processes

To gain a better understanding of the learning process of the NN presented in Figure 2, let us assume that our intention is to predict housing prices based on 2criteria: square feet (X1) and the number of bedrooms (X2). The process of using a dataset for learning is shown in Figure 4. First, the dataset is randomly divided into 2main portions: 1) a “training set” and 2) a “testing set.” The training dataset will be used to adjust the weights of the network (ie, back propagation). Then, the weights will be frozen, and the network will be tested on the “testing set” data. The comparison of the output of the network for the test data and the expected output will be reported as the performance of the network.

Figure 4.

Figure 4.

Data division in the learning process of neural networks. The training data are used to minimize prediction error by optimizing weights and testing data are used to report performance of the algorithm.

As shown in Figure 4, for each input data sample (eg, X1 [square feet] = 4800, X2 [number of bedrooms] = 5), there is a corresponding output sample (eg, price = 870 $k) in the dataset. These types of datasets are called “labeled data” and ML algorithms that use labeled datasets are called “supervised learning.”

Image Classification

AI techniques have been successful in categorizing images into classes. Categorization of images is something humans do subconsciously and falls into the realm of pattern recognition. One example lies with the field of radiology, where radiologists can look at multiple x-ray images and classify them as normal or presenting with pneumonia. In the AI community, this human-centered x-ray classification process can be modeled with “image classification,” a process in which the input into an algorithm is an image and the goal is to classify the images based on their contents.

Digital Image

Each digital image consists of small building blocks called pixels. For example, a captured full high definition (HD) image consists of 1920×1080 pixels. The more pixels used to represent an image, the higher the resolution. A digital color image is shown in Figure 5, in whichany color is a combination of 3main colors:red, green, and blue. Therefore, each color image consists of 3 color channels, where each channel provides the intensities of its corresponding red, green, or blue color. The average of these 3 color channels provides a grayscale image. A grayscale image is composed of shades of gray in which the intensity of each pixel is represented by a value between 0 and 255 (0 represents pure black and 255 represents pure white). A full HD digital image (1920×1080 pixels) in a grayscale format can be represented by a matrix of numbers with the size of 1920×1080, where each value is a number between 0 to 255 (Figure 5).

Figure 5.

Figure 5.

Color image and its corresponding red, green, and blue channels, as well as its grayscale image in which the intensity of each pixel is represented by a number between 0 and 255.

Image Classification Challenge

One naive solution of image classification is to put all pixel values of an image as the inputs to a multilayer NN(Figure 2). For example, if we want to classify some full HD images, we will end up having 1920 × 1080 = 2073600 inputs (X1, X2, …, X2073600). An increased number of inputs leads to the increased required number of weights. Although, in theory, it is possible to train such a network (finding optimum weights), it fails in practice mainly due to extreme computational expenses. In fact, the solution to this challenging problem initiated the new era of AI.

Start of Artificial Intelligence Revolution

In 2012, the AI community was presented with a research challenge to find the best solution to classify a dataset that consists of 1.2 million hand-annotated images belonging to 1000 different classes of objects, such as container ships, mites, leopards, cherries, etc. (ImageNet Large-Scale Visual Recognition Challenge [ILSVRC-2012]).4A research team from the University of Toronto, led by Professor Geoffrey Hinton, presented a remarkable solution5 showing far superior results than other traditional approaches by presenting a different type of neuron connection in a NN, called convolutional neural networks (CNNs), which is elaborated below.

Convolutional Neural Network

The structure of the network shown in Figure 2, is also known as a “fully connected network,” where in the first layer, each neuron is connected to all inputs, and similarly, in any other layer, each neuron is connected to all the other neurons in its corresponding previous layer. Indeed, a fully connected network is trying to find all possible correlations between all inputs (X1, X2, and X3).

In a naive solution of providing all pixels as the inputs (X1, X2, …, X2073600) to a fully connected network, the network tries to find all correlations between all pixels in an image X1, X2, …, X2073600. But do we really need to find all correlations between all pixels? The answer is no. For example, let us consider Figure 6 where there are 2tools in the surgical scenario: a grasper on the left and a bipolar on the right. The pixels related to the grasper are highly correlated to each other. Similarly, the pixels related to the bipolar are also highly correlated to each other. However, the pixels of the grasper are not correlated to the pixels of the bipolar. The idea to solve this problem is to develop a NN that emphasizes the correlation of the spatially close pixels and downplays the correlation between spatially distanced pixels.

Figure 6.

Figure 6.

The concept behind convolutional neural networks.

To achieve it, weights are organized in matrix format and swept over the image illustrated in Figure 7, where we have 16weights (W1, W2, ..., W16) arranged in a matrix format. When the weight matrix visits a location on the image, it covers a 4 by 4-pixel square area where each weight value gets multiplied by the pixel value it covers, and then summed and passed to a transfer function. In this example, inputs X1, X2, …, X16 are 16-pixel values (a number between 0 and 255) and the weights are W1, W2, …, W16. Therefore, X1 gets multiplied to W1, X2 gets multiplied to X2, …, until X16 gets multiplied to W16 and then summed and passed to a transfer function. This process is called “convolution” in the DL community. The NNs that use this technique are called CNNs. Due to the low number of required weights in these structures, the training of these networks is manageable.

Figure 7.

Figure 7.

Convolution process in which a weight matrix sweeps over all locations of an image in a systematic approach of moving in horizontal and vertical directions.

Figure 8 illustrates a generic architecture of a CNN including multiple convolutional layers followed by a small-sized NN.6To get a better understanding of the network function, the weights matrices were visualized in classifying human faces. As it is shown in the earlier layer of the network, the weights matrices learned basic features such as lines and circles, while in the later layers the weight matrices learned high-level features such as face structures. The high-level features extracted are then used to classify objects using a small-sized, fully connected NN.

Figure 8.

Figure 8.

Generic convolutional neural network (CNN) architecture and feature levels.

Let us consider one of the low-level features in Figure 8 (surrounded by a yellow box), which is responsible for extracting information about vertical lines. As this window sweeps over all locations of the image, it can extract all information about vertical lines at any location of the image. As a result, a small-sized weight matrix can extract all information about vertical lines in an image. Therefore, CNNs require a lower number of weights to be trained than fully connected NNs.

In summary, there are 2main reasons that are known to be the foundation for the great success of CNNs over fully connected networks in image classification: 1) sparsity of connections(using a sweeping window of weights that emphasizes on spatially close pixels as it downplays spatially distanced pixels) and 2) parameter sharing (a low-level feature [eg, vertical line] can extract information from all over the image). Therefore, CNNs use fewer weights, and training them is manageable.

As mentioned above, fully connected networks fail to model image data. However, they are still useful when the dimensionality of the input is low. In Figure 8, after 5layers of convolution, the dimensionality of the data will be reduced and a small, fully connected network is employed. In other words, fully connected networks cannot be used to model image data independently, rather as a subpart of CNN structures. Fully connected networks also have wide applications as independent models when the dimensionality of inputs is low, such as modeling kinematic data of a surgical robot for manipulation and control.7

Image Classification in Medicine

One well-known application of CNNs is in classifying medical images. At Stanford University, an AI team led by Dr. Andrew Ng designed and implemented a 121-layer CNN called CheXNet. This network analyzes chest radiographs as an input and outputs the probability of pneumonia along with a heatmap of areas where pneumonia is thought to be present (Figure 9). Dr. Ng and his team have shown their algorithm has a very close performance to human radiologists. Their performance evaluation is based on a metric called the F1-Score, ranging from 0 to 1, with 1 indicating the highest level of accuracy on a dataset (further explained in the section below on Metrics of Performance). The resulting F1-Score of their algorithm is 0.435, while the average F1-Score of 4 radiologists was 0.387.8

Figure 9.

Figure 9.

Chest x-ray image is provided to the convolution neural network (CNN) which then outputs the probability of pneumonia and a heatmap of locations (red area) that most indicate pneumonia.

Object Detection

In the previous section on image classification, we discussed how objects in images are recognized by AI algorithms. The underlying assumption was that there is only one object in the image. However, many real-world images contain multiple objects at different locations. For instance, in Figure 10, 3different tools exist in a surgical scenario at different locations. To solve this problem, typical “object detection” algorithms (variations of CNNs) are used to perform 2main tasks: 1) spatially localize the object of interest by generating a bounding box around the object in the image and 2) recognize the content inside the bounding box based on predefined categories. This process is illustrated in Figure 10, where 3tools in the scenario were localized by bounding boxes and labeled accordingly by the algorithm (probability of belonging to each class, shown in a white tag, ranging from 0 to 1).9

Figure 10.

Figure 10.

Object detection: type of tools are detected and bounding boxes show the spatial location of each tool.

You Only Look Once (YOLO) Algorithm:

Based on the CNNs structure, an object detection algorithm has been designed, called “You Only Look Once (Yolo).”10More advanced versions of this algorithm have also been developed.11,12For example Yolo-v4 is the fourth version that has shown superiority over the former object detection algorithms due to its computational efficiency.13Yolo algorithms can provide object detection results in real-time. The term “real-time” is defined as providing prediction faster than 30 frames per second where a human eye cannot notice any glitch in a streaming video. In other words, it is possible to feed a Yolo algorithm with video from a webcam and in a real-time stream, while all objects are localized with rectangular bounding boxes around them, and their classes are recognized and automatically labeled by the algorithm (see: https://www.youtube.com/watch?v=-tC7dmQIf_U).

Surgical Object Detection

In 2016, The University of Strasbourg (Research Group Computational Analysis and Modeling of Medical Activities) provided a video dataset of laparoscopic cholecystectomies for public use.14The Strasbourg dataset was then annotated for the location and type of tools by a team led by Dr. Serena Yeung at Stanford University.9The object detection algorithm used by Dr. Yeung provided bounding boxes for 7different objects: grasper, bipolar, hook, scissors, clipper, irrigator, and specimen bag. Figure 10 shows a snapshot of their algorithm as it demonstrates the recognition and localization of 2graspers and a hook. This research team showed that continuous localization information provided by the algorithm can potentially indicate economy of motion for surgeons and provide useful information for surgical skill assessment.9

Video Classification

In video analysis, an action performed by a surgeon or a group of surgeons over a period of time is called “activity.” Figure 11 depicts 6frames for the activity of placing the gallbladder into the Endocatch bag or “packing” during cholecystectomy.14 Information about the activity of packing exists in the combination of all 6 frames rather than each frame individually telling the whole story. In other words, this is a temporal problem in which the correlation between these sequential frames is the underlying information that defines an activity. As such, an algorithm to model temporal relationships needs to consider the sequence of frames instead of every single frame individually.

Figure 11.

Figure 11.

Packing activity in gallbladder removal is represented by 6 sequential frames.

Recurrent Neural Networks

“Recurrent Neural Networks” (RNNs) have been developed to tackle temporal problems. Figure 12 shows the concept behind the RNN algorithm where input from each time point (eg, x<1>) is fed to a NN, and then information is shared between these NNs using sharing information weights (‘Ws’ in Figure 12), that are learned during the training process.

Figure 12.

Figure 12.

Recurrent neural network (RNN) concept.

Surgical video classification

A combination of CNN and RNN algorithms hasbeen used to understand surgical video content by recognizing activities inside a video. Figure 13 illustrates this concept, which is referred to as “video classification.” In 2019, Hashimoto and colleagues presented a DL approach to identify operative steps in laparoscopic sleeve gastrectomy (LSG).15Their dataset included intraoperative videos from a total of 88 LSG cases. The videos were reviewed and manually segmented by board certified surgeons into 7 phases: 1) port placement, 2) liver retraction, 3) liver biopsy, 4) gastrocolic ligament dissection, 5) stapling of the stomach, 6) bagging the specimen, and 7) final inspection of the staple line. In this approach, each frame is first introduced into a CNN for the purpose of extracting high-level features (like high-level features in Figure 8), and then the high-level features are fed into a RNN to consider temporal aspects of the frames by learning and sharing information weights (‘Ws’). Finally, the algorithm outputs the probabilities of belonging to one of the 7 phases of the procedure. This method achieved 82% accuracy in the classification of the surgical videos into the 7 phases. The model utilized a type of CNN algorithm called ReseNet18,16 and a type of RNN algorithm called “long-short term memory” (LSTM).17Training DL algorithms requires a high amount of data examples, and one practice to mitigate this challenge is to break down each surgical activity video clip into smaller clips. For example, instead of considering one activity video clip (eg, packing) that has a duration of 1 minute from beginning to end, we can break it down to thirty 2-second clips. In this way, we end up having 30 video clips of packing instead of one.

Figure 13.

Figure 13.

Concept of video classification algorithm; combination of a convolutional neural network (CNN) and a recurrent neural network (RNN).

In a similar fashion, Jinand colleagues used a CNN+RNN model to perform video classification of a cholecystectomy dataset in 2018.14The videos were segmented into 8 phases: 1) trocar placement, 2) preparation, 3) Calot triangle dissection, 4) clipping and cutting, 5) gallbladder dissection, 6) gallbladder packing, 7) cleaning and coagulation, and 8) gallbladder retraction.18 This group used a different type of CNN algorithm called ResNet-5016 and used LSTM for its RNN algorithm. The authors later improved the accuracy of their algorithm in 2020, by incorporating an object detection algorithm into their previously tested CNN+RNN model.19

In another largerscale project, a total of 1051 laparoscopic cholecystectomy videos were annotated for additional findings beyond surgical instruments and procedure steps by an industry-based AI system. The annotations included disease severity, achievement of critical view of safety components, and presence of intraoperative events. The industry-based AI system also identified surgical phases of the procedure using a proprietary CNN+LSTM based algorithm. With the help of a streamlined, easy-to-use software interface, this AI system was shown to facilitate a faster review of videos by surgeon raters as the annotated and segmented videos were reviewed by surgeons at a rate of 50 videos per hour. When comparing surgeon assessment of the videos with the computer-generated AI assessment, there was agreement between surgeons and AI in identifying critical view of safety components in more than 75% of the cases. For intraoperative events, there was agreement between the surgeon and the AI system for 99% of the cases.20

Metrics of AI Algorithm Performance:

As shown in Figure 4, the “testing” data areused for evaluating the performance of an AI algorithm. This evaluation is based on the comparison of the network output versus expected output. The expected output of an algorithm (used for evaluation of testing data) is also called “ground truth.” Multiple metrics have been developed to perform comparisons between a network output and ground truth. In this section, we present the most common metrics used in evaluation of classification problems (image/video) and object detection problems.

Classification of Performance Metrics

Performance metrics for classification are presented in Figure 14. All metrics in this figure range between 0 and 1. Higher numbers indicate better algorithm performance. To clarify the meaning of these metrics, let us consider an x-ray image classification as shown in Figure 9. Here, the algorithm categorizes incoming images as having pneumonia or as normal. To evaluate the performance of this algorithm, assume we are going to present a “testing” dataset including 1000 x-ray images in which we have already predetermined that 100 images are pneumonia (positive) and 900 images are normal (negative). This is “ground truth.”

Figure 14.

Figure 14.

Classification metrics of performance.

Accuracy metric.

The “accuracy” metric is defined as the ratio of “total correct predictions” to “total predictions” (Figure 14). If we assume that the result of an algorithm on the “testing” data (network output) is presented in Figure 15, we end up with 0.92 when we calculate accuracy. This might look very good, but is it? The answer is no. If you look at the false negative (FN) block in Figure 15, you see the value 80, which means 80 people (out of 100 total) with pneumonia (positive) are recognized as negative (normal patients) by the algorithm! Therefore, accuracy is not the proper metric for performance evaluation of classification problems. The following paragraphs will give more context to this multi-layer problem.

Figure 15.

Figure 15.

Poor performance indicated by low recall. TN, true negative; TP, true positive; FN, false negative; FP, false positive.

Recall.

To further investigate the problem with using the “accuracy” metric, we can try the metric “recall,” which is defined as the ratio of “true positive” to “total ground truth positive.” In Figure 15, the value of recall is calculated as 0.20 which also indicates poor performance of this algorithm for classification. In other words, recall calculated how many ground truth positives (patients who in fact have pneumonia) have been classified as pneumonia positive by the algorithm. In this example, the existing FNs (classifying a sick person as normal) could be detrimental, thus recall is the best option.

Precision.

Now let us consider an algorithm output example shown in Figure 16. In this case, there is no FN and all patients with pneumonia are recognized as sick patients by the algorithm and we achieved a perfect score for the metric recall = 1. However, we have 80 false positives. It means 80 healthy people are recognized to have pneumonia, which means the algorithm is highly sensitive but has low precision. Such algorithmic mistakes could lead to unnecessary course treatments. To address this issue, we utilize another metric called “precision.” Precision is defined as the ratio of “true positive” to “total predicted positive.” If we calculate this ratio in Figure 16, we end up with 0.56 which also indicates poor performance.

Figure 16.

Figure 16.

Poor performance indicated by low precision. TN, true negative; TP, true positive; FN, false negative; FP, false positive.

In both Figure 15 and Figure 16 examples, we demonstrated that the metric accuracy is incapable of indicating FN or false positive problems while showing a high value of 0.92. That is the main reason the metric accuracy should not be used for classification problems, especially if the data is unbalanced (having different amounts of data in different groups). In this example, we have 100 patients in the pneumonia group while we have 900 in the normal group (unbalanced testing data) and accuracy failed to provide a good evaluation.

Figure 15 and Figure 16 also demonstrated 2extreme cases. In Figure 15, precision was high, and recall was low and in Figure 16, it was the reverse. In fact, depending on the subject of study and the importance of avoiding FNs or false positives, researchers can make decisions on how to weight recall or precision in terms of importance in their research. This is the most important take-home message when evaluating and determining which types of algorithms and metrics to use. In general, the performance of an algorithm should be good on both recall and precision metrics. Therefore, in many applications, a specific type of averaging these 2metrics is reported, the so-called “F1-Score” (Figure 14).

Object Detection Metric

As described in the section on Object Detection, the goal of the AI algorithm system and approach is to perform 2main tasks: 1) spatially localize the object of interest by generating a bounding box around the object in the image and 2) recognize the content inside the bounding box based on predefined classes.

Intersection over union (IoU).

To evaluate performance of the first task of an object detection algorithm, the IoUmetric can be employed. Figure 17 shows a bipolar energy tool where there is overlap between the expected output (ground truth) annotated by a human and the network output (predicted) by the algorithm. The IoU metric calculates the ratio of intersection to union. In the case where there is no overlap between the “ground truth” and “predicted” boxes, the metric value is 0and when there is perfect overlap, it will be 1. To decide on correct or incorrect localization, we need to define a threshold of IoU. The most common thresholds used in the literature are 0.5, 0.75, and 0.95.

Figure 17.

Figure 17.

The intersection over union (IoU) metric calculates the overlap between ground truth and network output.

Mean average precision.

As shown in Figure 17, the network not only identifies the class of an object (eg,bipolar) but it also provides a number between 0 and 1 that indicates the confidence of belonging to that class (eg, confidence = 0.87). To perform an overall evaluation of all bounding boxes identified by the algorithm, selection of different confidence thresholds leads to different values of precisions and recalls.

Figure 18 illustrates an example of plotting different values of precision and recall for different confidence levels. In a conservative approach, if we select a very high confidence (eg, confidence = 0.97), we end up having a high precision and low recall. On the other hand, in a relaxed approach, if we select a very low confidence (eg, confidence = 0.5), we end up having a low precision and high recall. The area under the precision-recall curve is called “average precision” (AP), ranging between 0 and 1 and a higher value indicates better algorithm performance. The average of this metric for all classes of objects is called “mean average precision” (mAP).

Figure 18.

Figure 18.

Average precision (AP) is defined as the area under the precision-recall curve depicted for different confidence levels.

AI Workload

Overall, many factors influence the pace of research. To provide an insight on the required times, Table 1 exemplifies 2cases: 1) an object detection problem with a dataset of 1000 annotated images for the location and type of surgical tools, and 2) a video classification problem that the dataset includes 100 procedural videos.

Table 1.

Estimated time commitment for conducting specific research steps.

Object Detection (using 1000 labelled tools)
Research Step Time Commitment
Data preparation 2–3 weeks
AI development and implementation 1–6 months for new algorithms; 1-6 weeks using existing algorithms.
Training computational cost 1–2 days, using transfer learning; 1–2 weeks without transfer learning.
Testing computational cost Few minutes for entire testing data and real-time for a testing sample.
Video Classification (using 100 videos of procedures)
Research Step Time Commitment
Data preparation (labelling takes the most effort) 1–2 months
AI development and implementation 3–9 months for new algorithms; 2–8 weeks using existing algorithms.
Training computational cost 2–3 days, using transfer learning; 2–3 weeks without transfer learning.
Testing computational cost 1–2 hours for entire testing data; few seconds for a testing sample.

AI, artificial intelligence.

ARTIFICIAL INTELLIGENCE IN SURGICAL RESEARCH: A CASE STUDY FROM THE T.E.C.I. CENTER

As we work towards improving and perfecting AI applications in surgery, there continues to be a critical need for surgical performance data that can serve as a point of departure for collaborators. In this section, we provide an example of a synchronized, multisource surgical performance database that was used as the basis for multiple research collaborations geared towards exploring the utility of AI for assessing surgical performance. First, we explain the study design and then present the role of each collaborator.

Data Collection Venue and Process

In 2019, researchers from the T.E.C.I. Center at Stanford University Department of Surgery formed a collaboration with the leadership of the American College of Surgeons (ACS). The overarching goal was to jumpstart the conversation on digital surgical performance data and how it may be used in research and clinical practice. The inaugural event took place in the exhibit hall at the 2019 Clinical Congress in San Francisco, California. Wearable technology was used in a simulated surgical environment to capture surgical performance data and to empirically investigate the relationship between surgical decision-making and technique to procedural outcomes. A total of 255 attendees, including 40 resident physicians, 201 attending surgeons, and 14 retired surgeons wore a variety of sensors while repairing a segment of porcine intestines with a standardized injury. After arriving at the exhibit booth, participants completed a 1-page demographic survey and were guided to one of the 10 simulated OR tables(Figure 19).

Figure 19.

Figure 19.

Ten data collection stations at the 2019 American College of Surgeons Clinical Congress.

Surgeons were asked to run the bowel to look for enterotomies and perform a repair. Two web cameras collected videos of each surgeon’s simulated repair. One camera provided a top-down view of the whole operative field, and another was used for focused capture of the surgeon’s hand movements. Six motion monitoring sensors were used (thumb, index, and wrist of each hand) to track the hand location and movements over time (Figure 20).

Figure 20.

Figure 20.

Participant performing bowel repair on the simulated bowel enterotomy scenario with motion sensors affixed to his hands under sterile surgical gloves.

Surgical Decision Pathways

Preliminary findings from the ACS-Stanford University Surgical Metrics Project have identified decision-making strategies and technical approaches that significantly affect procedure outcomes (bowel leak). Human annotation of the video database was performed by 2 researchers. From the 156 participants who identified both bowel injuries, 55 unique operative decision paths were identified. Figure 21 provides a high-level diagram illustrating how the data and operative steps were stratified. To minimize the human effort needed to fully map and understand the variety of decisions made by the participants, our goal was to enlist AI researchers to develop algorithms that could automatically detect the operative decisions.

Figure 21.

Figure 21.

Illustration of the operative workflow from visual inspection to multiple decision pathways that lead to an operative outcome.

Building Successful AI Collaborations

It can take several years to build the ideas and infrastructure to properly analyze data using AI. An excellent option for people who have data is to find collaborators with expertise in AI. From our experience in vetting AI research teams, many AI researchers have an interest in surgical data and each team has different interests and preferred approaches to data analysis. Given the complexity of our data, it made the most sense for us to explore a variety of different collaborations at once to allow an opportunity to cross-compare the outcomes of different analysis paths. Issues like shared communication, workflow, and analysis plans all vary between research teams. In this section, we present our collaboration experience with 3different groups. In each example, we share the steps taken to form these relationships, how the relationship evolved over time, and a summary of the analysis conducted in each collaboration.

Overview of the Analysis and Role of Each Team

The goal of AI analysis of the Surgical Metrics Project dataset is to automate output information that human observers could otherwise provide (Figure 22). To develop such a system (called core AI herein), different teams focused on developing different parts of the system including medical knowledge, object detection, and video classification (Figure 22).

Figure 22.

Figure 22.

Overview of the artificial intelligence (AI) system for American Collee of Surgeonsdata analysis and the role of collaborators.

Data Use Agreement

Collaborations involving shared datasets require a data use agreement (DUA) which is a formal contract that sets rules for the exchange of data between 2or more parties. A DUA documents who can have access to the shared data, what sort of manipulation or analysis can be run on the data, as well as publishing rights and other permissions or restrictions. A DUA is an important step in maintaining the safety and confidentiality of data and is required when a dataset includes protected health information, personally identifiable information, or moderate/high-risk data. Additionally, for data crossing international borders, a DUA is often required.

John Hopkins University Computational Interaction and Robotics Laboratory

Relationship

Gregory D. Hager, PhD, is the Mandell Bellmore Professor of Computer Science and Founding Director of the Malone Center for Engineering in Healthcare at Johns Hopkins University (JHU). He is also the director of the Computational Interaction and Robotics Laboratory (CIRL) in the JHU Department of Computer Science. During his 22-year tenure at JHU, Dr. Hager’s research has revolved around vision-based robotic, time-series analysis of image data, and medical application of image analysis.21Dr. Hager also developed the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS), a data set of surgical activities consisting of kinematic and video data from surgeons using the da Vinci Surgical System.22Dr. Hager is also well known for his pioneering work on surgical ontology, in his case, the “language of surgery,” which seeks to model the steps of surgical procedures and evaluate surgical skill from operative data. In addition to his contributions to the development of advanced imaging methods for image guidance, and the quantification of human performance with interventional systems, Dr. Hager has co-founded 2startup companies dedicated to the betterment of medical device usage. Clear Guide Medical enables doctors and technicians to perform more accurate ultrasound-guided procedures and Ready Robotics is dedicated to making industrial robots easier to use.

In 2015, Dr. Pugh was invited for a visiting professorship at JHU with the Department of Surgery and the Department of Engineering, where she became acquainted with Dr. Hager and his research team. They reconnected soon afterward in 2016 at the Surgical Data Science Meeting in Heidelberg, Germany, where Dr. Pugh presented her work on using motion and video data to quantify surgical performance. With aligned research interests, kindred entrepreneurial spirits, and shared determination, a synergistic relationship was cultivated, transcending state lines and time zones. Dr. Pugh and Dr. Hager fostered a cross-country collaboration and wererecently granted a $2.5 million NIH-R01 grant award (1 R01 DK123445–01).

Nature of the T.E.C.I.-CIRL Collaboration

The T.E.C.I. Center (Stanford University School of Medicine) and CIRL (JHU) are 2research labs made up of clinicians, engineers, and researchers who are experts in their field. Communication has been the key to traversing the landscape of this multidisciplinary collaboration to get the right focus and research deliverables. In addition to weekly meetings, members from both laboratories share and review current findings, updates on tasks, and provide thoughts on future directions. During the days between meetings, laboratory members from both groups communicate to discuss future work and goals. The workflow usually consists of CIRL, the computer science researchers, asking for the T.E.C.I. Center’s medical expertise to streamline efforts towards an AI solution based on automated feedback. Since both collaborators put their resources into the project, they are fully committed.

Strict Data Sharing Requirements

To protect the surgical performance database, the T.E.C.I. Center needed to ensure all researchers on both teams had the appropriate training on how to handle human subjects’ data and how to follow institutional review board (IRB) policies. These policies included: 1) de-identifying the data and 2) securely sharing the data. To de-identify the video data, T.E.C.I. Center researchers cropped out faces and blurred conference nametags. Data weresent using Stanford’s secure BOX (r) infrastructure. The period of data authorization for sharing can lead to halts in research and can be frustrating. To mitigate this hurdle, data transfer protocol discussions should begin as soon as possible so that when the project begins, there is no delay in analysis.

Time and Team Management Around Human Dependent Video Annotation

When training an AI model, annotation of the videos is a pivotal steppingstone as it provides ground truth information as an input for the model. The amount of time used to annotate videos can vary depending on the number of annotation features needed and the length of the procedure. In the case of the Surgical Metrics Project, the videos ranged from 15 to 20 minutes each. Annotation features consist of the start and end time points of specific surgical tasks and maneuvers, tool usage for both assistant and participant, and the various active or inactive states of tools. Each annotation is done by a human reviewer and annotating a single participant’s video could range from 2 to 4 hours to complete. With more than 100 participants, that is more than 300 hours of work to complete the annotations. If you had 3 reviewers working 6 hours a day solely on annotation, it would take about half a month to complete all the annotations. Annotation is also a very tedious and meticulous process and requires a well-defined protocol for consistency. For instance, the end of a cutting activity using a pair of Metzenbaum scissors has been defined for the annotators as “when they observe that the two blades of a scissors meet.” Defining a good protocol mitigates some inconsistencies in data annotations. Waiting multiple weeks for annotations to be completed is a lot of time spent not knowing if the AI model would even work on your data set. One way to use this time efficiently is to agree on a rough roadmap for the data analysis and modeling (eg, define and clarify successful outcomes, discuss potential pitfalls with the modeling strategies, and agree on alternate solutions, etc.). Splitting the work across both teams is important as it improves team communication by making both groups clarify and agree on the intended outcome.

CIRL Data Analysis

The CIRL team is engaged in programming AI systems for: 1) identifying the type of surgical maneuvers (such as suturing, cutting, knot tying) and 2) identifying suturing strategies (interrupted, running sutures). Two types of data are being leveraged to achieve these goals: 1) motion data (as described in the “Data Collection Process” section above) and 2) instrument usage data. The instrument usage data is derived from the videos as follows: 1) the videos are annotated by reviewers and for every second in time that the instrument is in the surgical field its activity state (how it is being used) is annotated. Six possible states are usually noted for each instrument: 1) surgeon active—active in the surgeon’s hand; 2) surgeon inactive—inactive in the surgeon’s hand (ie, the instrument is docked in the surgeon’s hand but not being actively used); 3) assistant active—active in the assistant’s hand; 4) assistant inactive-inactive in the assistant’s hand; and 5) surgical field inactive—not present in the frame but used. These different states are shown in Figure 23.

Figure 23.

Figure 23.

Instrument usage in 5 different states.

Algorithm

An LSTM algorithm is trained on the motion and instrument usage data to identify the following surgical maneuver classes: suture throws, knot ties, thread cuts, and background. The fourth class indicates that none of the first 3actions are being performed. Furthermore, a variant of the LSTM is used to detect the suturing strategy: 1) interrupted suture repair—a suture in which each stitch is tied separately and 2) running suture repair—one continuous suture is used to close the wound, alternating edges as you progress.

Results and Next Steps

The model was able to identify surgical maneuvers with an accuracy of 79%. A precision of 56% was achieved in identifying suturing strategy. Future steps include incorporating the video data, to improve the performance of these models.

Stanford Medical AI and Computer Vision Lab (MARVL)

Relationship

Dr. Serena Yeung is an Assistant Professor interested in Biomedical Data Science at Stanford University. Through mutual connections, members of the T.E.C.I.Center first met Dr. Yeung in 2019. Currently, Dr. Yeung is the Director of the Stanford Medical AI and Computer Vision Lab (MARVL), where her research focuses on CV, ML, and DL in healthcare applications.23Priorto accepting her position at Stanford, Dr. Yeung worked at Facebook on the Facebook AI collaborative from 2016 to 2018 and then pursued a project at Google working on the Google Cloud AI from 2018 to 2019. Dr. Yeung is also affiliated with the Stanford Clinical Excellence Research Center. Dr. Pugh and Dr. Yeung first met when working together with fellow colleagues interested in the use of AI for analyzing surgical video. Dr. Pugh and Dr. Yeung met at the T.E.C.I.Center to discuss future endeavors. From there, they detailed the capabilities of their laboratories and discussed areas in which they could provide and combine their expertise. At the end of these meetings, a mutually beneficial project was agreed upon for both the T.E.C.I. Center and Stanford MARVL.

Nature of the TECI-MARVL Collaboration

As a new laboratory focused on medical AI, the Stanford MARVL laboratory drew the interest of many collaborators and had several well-established and funded projects. Compared to their other projects, committing resources and time to the T.E.C.I. collaboration would require careful planning. To ensure success, the T.EC.I. Center led the initial discussion on data sharing, general project knowledge, and timelines and agreed to prep the data based on established goals across both teams. After completing the data preparations, MARVL was able to allocate their resources more strategically.

The Strategic Step-by-Step Approach

In the first iteration of data analysis, MARVL provided an undergraduate student to lead the data analysis as other laboratory personnel were already allocated to funded projects. Due to a variety of reasons (eg, student classes and other commitments),this process was notably slower and was supported and facilitated by team meetings with senior researchers and laboratory principal investigators. Of note, the academic calendar also had a significant impact on the availability of the undergraduate student, making it difficult to achieve a consistent and efficient stream of communication and commitment to the project. The T.E.C.I. Center continued to lead the collaborative effort knowing that these were the steps needed to be taken to help develop a long-term working relationship supported by funding. In the end, there were major successes in data sharing and project goals and eventually, MARVL had the resources to provide a more senior researcher who had minimal curricular restraints and more experience as an independent researcher. These attributes greatly accelerated team workflows and data analysis.

Managing Expected Deliverables

Working with MARVL required the T.E.C.I. Center to provide 2pillars of support: data security and data management. The first pillar, data security, required the T.E.C.I. Center to be knowledgeable in data security protocols set in place for their datasets. That would mean administrative tasks relating to data security would be under the T.E.C.I. Center’s responsibility such as adding personnel to an IRB, authorizing researchers in training, or organizing the distribution of data. The second pillar, data management, required the researchers of the T.E.C.I. Center to make sure files were in proper formatting and met the requirements of MARVL. Data management also consisted of data annotation to manage the key features MARVL would be interested in investigating. Similar to CIRL, the workload to finish the annotation of one participant could take hours to complete. Table 1 provides a summary of the time commitments required for some of the research steps. This is extremely important information for new groups seeking to build research collaborations in AI. Lastly, data management required validation and authentication of each data stream to avoid corruption in the analysis.

Stanford MARVL Analysis

Before working together with the T.E.C.I. Center, MARVL trained a model to identify hand anatomy landmarks (eg, wrist and finger joint locations), surgical tool detection (eg, forceps, needle drivers), as well as surgical activities (eg, suture tie, cut) from a library of surgical videos. Although the surgical procedures, tools, and activities of these videos varied from the T.E.C.I.Center’s data, the parameters learned from training this model could be useful as a baseline. The model would need to be retrained using the T.E.C.I. Center’s video data to achieve a separate, but related, learning task.

Data

The videos collected during the Surgical Metrics Project from the simulated enterotomy procedure were scanned to identify times in the video when the surgeon’s activity matched the surgical maneuvers of interest. The start and stop timestamps of these activities were noted. There were 3such maneuvers or classes: 1) suture throw, 2) knot tie, and 3) thread cut. The videos were divided to extract clips corresponding to these classes.

Algorithm

The approach taken by the MARVL group involves feeding these video clips to a variant of CNNs to detect the surgeon’s maneuvers.

Results and Next Steps:

Figure 24 shows the pattern of maneuvers for each type of suture repair. Each row represents a procedure, and the horizontal axis represents the time. Each block of color indicates the duration of time in which1of the 3maneuvers was being performed. Green indicates suture throws, blue indicates knot ties, red indicates thread cuts, and white indicates “other” background information (none of the other 3maneuvers are being performed). The difference in patterns for both types of repairs is visually perceivable. Specifically, it can be observed that in the interrupted suture repair the 3maneuvers (suture repair, knot tie, and thread cut) happen consecutively and repeatedly. In the case of the running suture repair, several suture throws are followed by several knot ties and thread cuts. These are the types of different patterns that the newly developed deep NN can recognize and provide information to the core AI algorithm.

Figure 24.

Figure 24.

Pattern of maneuvers used in interrupted suture repair versus running suture repair showing the increased presence of knot tying(blue) and suture cutting(red) for interrupted.

These findings are complementary to the CIRL approach as it confirms that a pattern exists for different technical approaches and that the patterns can be quantified and differentiated.

Technion University

Relationship

Shlomi Laufer, PhD, was a postdoctoral researcher with a background in biomedical engineering in Dr. Carla Pugh’s research laboratory at the University of Wisconsin-Madison from 2012 to 2016. Dr. Laufer began his research with Dr. Pugh by collaborating in the augmentation of a previously developed sensorized clinical breast examination simulator funded by an NIH R01 grant and built to capture and quantify clinical expertise. Later, Dr. Laufer was the lead senior researcher in the Pugh laboratory who conducted research on surgical skills decay funded by a grant from the Department of Defense (DoD). The data collected from this project have been used to build his research focus and help build the foundations of his own research laboratory. When Dr. Laufer’s time ended with Dr. Pugh’s laboratory, he transitioned to Technion University in Israel, where he is Assistant Professor of Industrial Engineering and Management. The research performed in Dr. Laufer’s laboratory focuses on the use of advanced sensor and video technologies, as well as dataanalysis techniques to measure human performance and workflows. Special emphasis is given to automatic assessment and measurement of clinical and surgical proficiency.

Dr. Pugh and Dr. Laufer continued to collaborate after his departure from the Pugh laboratory in Wisconsin. For the 2019 Surgical Metrics Project, Dr. Laufer attended the ACS data collection and developed a device that was used to test bowel leaks by pumping fluid into the intestines. Dr. Laufer also took interest in helping to gather other data streams and attends regular lab meetings with the T.E.C.I. Center regarding the data analyses of the Surgical Metrics Project Data.

Nature of the TECI-Technion Collaboration

The Technion team, led by Dr. Laufer, has an ongoing research interest in the data collected from the Surgical Metrics Project. The Technion team had prior experience working with multimodal data and Dr. Laufer was very familiar with the methodologies being used. Dr. Laufer was originally part of the development team that created the T.E.C.I. Center’s multimodal data capture approach for the DoD grant, thus giving him strong insight into the strengths and weaknesses of the data collection technology, approach, and natural limits to the analysis strategies. Having a collaborator that was once part of the T.E.C.I. Center made for a smooth approach to conversations and workflow, leading to efficient implementations of research team strategies and updates to the analysis protocol. The relationship between Dr. Pugh and Dr. Laufer created a strong sense of collaboration due to the harmonious back and forth of ideas between the 2different researchers. Dr. Laufer is an expert in the field of computer science and engineering, but his experience when working with Dr. Pugh gave him insight into the clinical application that he might not have been able to incorporate sufficiently without the collaboration. A prior collaborative relationship and the ability to see from multiple, previously established viewpoints gave this collaboration an exceptional opportunity to be efficient and refine the new project goals and further develop impactful results.

Major Hurdles

Dr. Laufer’s Technion team is in Haifa, Israel while the T.E.C.I. Center is located in California, in the United States. The 10-hour time difference brought hurdles when scheduling meetings. Most meetings would be set late at night for Technion, making it difficult for their team members to attend. Due to this, Dr. Laufer was usually the only member that would be capable of attending and be the only source of contact between Technion and the T.E.C.I. Center. This caused a lag in communication between the collaborators and their members making progress. Fortunately, Dr. Laufer was able to utilize his position as the point of contact and facilitated effective communication from T.E.C.I. Center members to members of his team.

Managing Expected Deliverables

The collaboration between Technion and the T.E.C.I. Center relied on the outcomes of the data collection for the new Surgical Metric Project and how well it was and is managed, coded, and stored. Technion provided major support to the data collection effort by joining in planning meetings about the data collection andvolunteered to design a system to consistently pump fluids into the porcine intestines for conducting leak tests. Most of the work done by the T.E.C.I. Center would be managing the data collection, following data management protocols, and providing administrative support. Technion would help by advising possible routes to successfully transfer and manage data, but it would fall on the T.E.C.I. Center to provide the hands-on work in accomplishing the data transfer and management tasks. Other areas of expected deliverables by the T.E.C.I. Center were setting up the data sharing protocols between international collaborators and managing data security and de-identification.

Technion Analysis

To perform an analysis that identifies surgical maneuvers, such as suturing, thread cut, and knot tying, it is important to identify the tools being used by the surgeon and surgical team. Therefore, the Technion team focused on the object detection aspects of this problem.

Data

The videos from the Surgical Metric Project data collection were processed and approximately 5000 frames were selected and annotated by placing bounding boxes around the hands and tools, example seen in Figure 25.

Figure 25.

Figure 25.

(A) Result of object detection algorithm developed by Technion team on recognizing hands and tools in bowel repair simulation; hands and tools were all detected individually. (B) the algorithm can detect which tool is in use with which hand.

Algorithm

Subsequently, the Technion team designed a DL algorithm based on the Yoloobject detection approach. They also improved their algorithm by considering the temporal information between frames in terms of identified objects using an RNN model.

Results and Next Steps

In addition to the detection of tools and hands in the video, the model is also able to provide information about tool specificity by hands. For example, it can distinguish between a needle driver being held in the right hand versus the left hand, as shown in Figure 25. In the future, the object detection information provided by the Technion team’s algorithm will be incorporated into the core AI algorithm.

Summary

Each of our collaborators have played a critical role in helping us to develop a strategy for analyzing the Surgical Metrics Project data. Figure 22 shows the different approaches and contributions of each of our collaborators. Eventually, we will combine the different algorithms and approaches with the goal of codifying an algorithm protocol that can be used for other datasets investigating surgical mastery.

THE ROLE OF ARTIFICIAL INTELLIGENCE IN SURGERY

Disruptive Trends in Healthcare: Surgical Intelligence

AI stands to disrupt the surgical field by serving to augment a surgeon’s decision making, address variability in surgical outcomes, and reduce surgical errors. The academic leaders in the surgical AI space have determined that surgeons are well-positioned to help integrate AI into modern practice.24The Information Age offers a platform for transdisciplinary collaboration across engineers, data scientists, and surgeons to revolutionize the ways in which surgery is taught and practiced.24Such opportunities may include an improved knowledge base (via enhanced information access and exchange), enhanced operative decision making, or using AI for the evaluation of data to improve patient selection, patient safety, and the patient experience.

Improving Clinical Decision Support

Since trumping humans in Jeopardy in 2011, Watson has been further developed to use AI for clinical decision support. The Watson for Oncology (WFO) tool evaluates patients’ symptoms and offers support based on knowledge from more than300 medical journals and textbooks. The recommended treatment plans can be divided into 3categories: “recommended,” where there is strong evidence for the therapy to be considered; “for consideration,” where these therapies are suitable alternatives that can be considered based on the physician’s judgment; and “not recommended,” indicating therapies that have strong evidence to not be used.22

Surgical Augmentation

Another application of AI in surgery is the use of AI-enabled augmentation for surgical procedures. Navarrete-Welton and Hashimoto conducted a literature review on the current state of research on AI-augmented intraoperative decision support. In the review, they looked at 3main categories for developing decision support:1) increasing the information available to surgeons; 2) accelerating intraoperative pathology, and 3) recommending specific surgical procedure steps. They were able to identify a wide variety of journal articles for each of these 3categories and organized the articles according to the clinical applications. They found a wide variety of clinical applications including using AI to control the amount of ultrasonic energy released during phacoemulsification surgery as well as using AI for localizing glioblastoma tumors after initial resection. From their findings, they concluded that research on AI-augmented intraoperative decision support has been successful in determining the possible points of intervention in the surgical workflow, but the field is still very much in its infancy and will need continued future research to reach the levels expected from surgeons to be of use from a decision support perspective.25

An example seen in the reportby Madani and colleagues26showsa model was trained to identify anatomy in a laparoscopic video to potentially indicate “go” and “no go” zones in the liver, gallbladder, and other areas. Such models may be used in the future to provide real-time visual demarcation that allows surgeons to identify and avoid high-risk zones to mitigate adverse events like common bile duct injury. An additional example is the Monarch platform (Auris, Redwood City, CA), which has allowed physicians to find peripheral pulmonary nodules and achieve precision in diagnosing lung cancer. This is accomplished by a robotic vision-enabled peripheral bronchoscope.27

The use of digital pathology in clinical research has led to the development of machine learning and AI methods that can perform tissue interrogation and potentially improve our understanding of disease mechanisms.28We see these methods already in use to understand kidney28 and liver diseases. In a donor liver study, the authors use a CNN to identify the likelihood of graft macrosteatosis (MS) during liver procurement, which had a higher agreement with expert reviewers compared to the on-service pathologists at the time of initial evaluation.29These methods have also been helpful in identifying cancer in biopsy tissue 3032. In one example, the authors created a CNN to develop features from hematoxylin and eosin-stained breast biopsy images that can be used to distinguish between carcinoma and non-carcinoma.33Their model achieved 83% accuracy in classification and 95% sensitivity. Additionally, Bultenand colleagues developed an automated DL system that was able to detect the Gleason scores of random prostate biopsies with similar performance as pathologists, which could help with diagnosing prostate cancer. The use of AI technology for pathologic diagnosis is a small step in enabling patients to have care closer to their homes.

Analysis of Large Data Sets

Another way AI may have the opportunity to disrupt surgery is by using AI to analyze large volumes of data for patient safety purposes. Dr. TeodorGrantcharov, a surgeon-scientist and entrepreneur, hasadapted the black box paradigm of aviation safety to the surgical suite. The operating room (OR) black box process involves the use of sensorized technologies in the hospital environment that can measure and record a plurality of parameters during the procedure. Evaluating these data with clinically trained DL systems to perform advanced data analytics can help identify patient safety issues as well as technical performance and then correlate these data with patient outcomes.34Other researchers have also used AI to improve hospital efficiencies. Current OR scheduling systems can be problematic if they schedule too much time or too little time for surgery. Many data points affect the scheduling including the patient’s body-mass index (BMI), the typical length of the surgery in question, and the time taken to clean and set up the OR in between cases (ie, OR turnover time). With the value of the OR room per minute estimated to be $6235, underutilization is a situation hospitals try to avoid. Similarly, for patient satisfaction, appropriate timing is also key. Recent work has shown that there is potential for a more efficient way of scheduling, using AI. Relevant patient information is used to build a model to predict how much time is needed for a certain procedure.36These estimates can then be used to schedule more realistic lengths of time for surgeries.

Healthcare data havea wide variety of sources such as insurance claims, physician notes, diagnostic data such as images from scans, and information from wearable devices. There is a current need to integrate these data to facilitate a better understanding of patient safety variables. In addition, there is a current need for diverse data streams to be integrated so that they are easily accessible and referenceable for large-scale analysis. For example, graph analysis techniques are often used to understand relationships between factors in large amounts of data by visualizing graphs consisting of nodes and edges. Such techniques have been used by some hospitals to understand the relationship between complex variables such as laboratory results, patient history, and nurse notes in order to predict adverse outcomes in patients.37

National Organizations

As AI continues to surge in the healthcare sector, the organizations and communities that guide surgical excellence have increased their involvement in exploring the use of AI-based technologies to improve patient outcomes. As such, many have developed pilot programs using AI in a variety of ways.

American Board of Surgery

The American Board of Surgery (ABS) has launched a pilot program to explore the scalability of a video-based assessment platform by partnering with 3industry groups.38The video-based assessment platforms leverage AI and computer vision to help surgeons benchmark their performance by pinpointing areas that need improvement or redirection and ultimately, improve the quality of patient care. The ABS will recruit at least 150 board certified surgeons to participate in the pilot involving the 3video-based assessment platforms. Surgeons will use a standardized process to upload videos of their operations and will also be asked to review videos of their peers. Surgeons will submit videos from a predetermined list of procedures and videos will be de-identified for patient and surgeon anonymity. Pilot participants will provide feedback on their experience with the video-based assessment platforms to the ABS and will also have the opportunity to receive quantitative and qualitative performance feedback. The ABS will be exploring surgeons’ ratings, engagement, performance data, and other key indicators with respect to the 3video-based assessment platforms. This pilot seeks to assess the feasibility of AI-powered video-based assessment as a component of the ABS Continuous Certification Program and will open conversations regarding full implementation in the future. ABS leadership will use the insights from this pilot to make recommendations that may help to guide the development of systems that objectively measure performance guided by patient outcomes.

Society of American Gastrointestinal and Endoscopic Surgeons

The Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) has formed an AI Task Force to improve patient care by augmenting surgeons’ cognitive capabilities and operative experience through the dissemination and democratization of knowledge leveraged by the use of AI.39As presented by Sol Oh and colleagues at the 2017 SAGES conference, AI can be used to assess the critical view of safety (CVS) in laparoscopic cholecystectomy, where 0.3% of patients are at risk of bile duct injury. 40Artificial intelligence algorithms were developed using videos of laparoscopic cholecystectomy that were manually rated on a previously validated CVS scale. The algorithm model may serve as a decision support mechanism to assist surgeons in the safe dissection and recognition of the CVS in laparoscopic cholecystectomy. Furthermore, SAGES has launched a Video-Based Competency Assessment Program and is working with an AI-based surgical intelligence platform to conduct a pilot for laparoscopic Nissen fundoplication. The pilot program will leverage their toolbox of smart annotation and video analytic technology to evaluate key operative skills that correlate with patient outcomes.41

American College of Surgeons Master Surgeon Educators:

The ACS Academy of Master Surgeon Educators is making its way into the AI space by organizing platforms to discuss the future of AI-based surgical intelligence and by offering programming led by pioneers in the field.42In 2019, the ACS Academy of Master Surgeon Educators hosted a special symposium on “Emerging Technologies and Artificial Intelligence in Surgical Care and Education” to highlight key advances in this rapidly evolving domain; define the application of these advances to surgical care and surgical education; underscore the tremendous potential ahead; and create a specific roadmap for the Academy of Master Surgeon Educators to pursue. Following their inaugural AI session, the ACS partnered with the Department of Surgery at Duke University to produce a web-based educational tutorial on AI and how it can be used to make advances in surgical care.

The International Surgical Data Science Group

The Surgical Data Science Initiative serves to improve the quality of interventional healthcare and its value through capturing, organizing, analyzing, and modeling data. The Surgical Data Science Initiative was founded to advance the emerging field of surgical data science by offering a platform for international collaboration. The committee has described a vision for the future OR where the surgeon will benefit from a plurality of connected devices and access to relevant patient data across all clinical disciplines. These features can be used to improve surgeon training, aid in decision making, facilitate scheduling, and anticipate the need for resources. The Surgical Data Science Initiative aims to facilitate the methods in which the next generation of medical students and trainees learn from complex data without being restricted to a specific book or instructor.43

LIMITATIONS AND FUTURE DIRECTIONS

As we explore the future possibilities of surgical intelligence augmented by AI, we must also consider the many limitations of AI applications in surgery. In this section, we describe 4major limitations and discuss some promising solutions for each: 1)lack of labeled data; 2)research efficiency; 3)bias for underrepresented demographics; and 4)lack of infrastructure and standards for data collection.

Lack of Labeled Data

One serious limitation in applying AI applications to the surgical domain is the lack of labeled or annotated data. For an AI algorithm to perform a task with high accuracy, (eg, identifying surgical maneuvers such as suture throws in a video), it first must be trained using a large volume of data specific to that class. The task of identifying suture throws in a video would involve training the algorithm on video clips containing only suture throws. Depending on the amount of data available, this may involve a significant amount of person-hours watching multiple surgery videos and precisely noting time periods that contain only suture throws. Achieving high accuracy for AI algorithms is directly linked to the volume of labeled data, which could pose a problem and cause delays in analysis. For example, as mentioned in the above case study of porcine bowel repair for the Surgical Metrics Project, the time taken to complete a labeling task for each video (average a duration of 15–20 minutes), took 2to 4 hours to complete. For a large number of videos, the time taken to finish annotating the videos for the training of models could be a significant bottleneck to the progress of AI-based projects.

Solution

One common practice to address the lack of sufficient data is to use a technique that involves initializing weights on a NN before the training process. As illustrated in Figure 2, weights of a NN get adjusted during the learning process. To train a NN from scratch, weights get initialized with random numbers (eg, a random number between 0 and 1) and then adjusted during the learning process. However, weights can be initialized more clearly. For example, Jinand colleagues in 2018 used a DL ResNet-50 algorithm in their video classification model.18However, instead of training ResNet-50 from scratch (initializing weights of ResNet-50 by random numbers), they initialized the weights using the weights of another ResNet50 that was already trained on ImageNet dataset.16The rationale behind this approach comes is that surgical images and natural images share many similarities in features (especially low-level features) (Figure 7). Therefore, weights trained on natural images of ImageNet dataset have already learned some features required to recognize surgical images. In summary, the process of initializing weights of a NN using the weights already trained on another dataset is called “transfer learning,” which is a very common practice in DL.

The other approach to handle the lack of data is to add artificially generated data, which is called “data augmentation.” There are some simple data augmentation techniques, such as mirroring an image. For instance, to train an image classifier to detect pneumonia in x-rays, we can easily mirror each input image horizontally and then use the mirrored image as an input to the NN. This allows you to double the number of sample points. There are also more advanced methods for data augmentation such as a type of DNNs, called “generative adversarial networks” (GANs). This type of DNNs can generate artificial images that are very similar to real images. One amazing example of the capability of these networks can be seen in how realistic they can generate artificial human faces that are not easily distinguishable from real human faces.44,45 Lee and colleagues in 2019 used a type of GAN network and showed their algorithm can generate images of surgical tools in a realistic surgical environment. These artificially generated tool images could potentially address the lack of data regarding less-used tools during surgery.46Development of GANs is a hot topic of research and applications of them in surgery are expected to increase.

In Figure 3, an example of a labeled data is shown where for each input data sample, there is a corresponding output in the dataset. The other types of datasets are called “unlabeled data,” where there is no information about the labels of outputs. The ML algorithms that use unlabeled datasets are called “unsupervised learning.” Unsupervised learning algorithms try to extract underlying patterns in data. For example, they can be used for data clustering.47

Research Efficiency

The pace of conducting AI research on surgical datasets depends on many factors. To present an overview of research efficiency, we consider the following steps: data collection, preparing datasets, AI development and implementation, computational cost, and reporting results.

Data Collection

Operating room video data collection.

In collecting OR videos, some influential factors for success include: 1) infrastructure of data collection and recording, 2) privacy approval processes, 3) sterilization concerns (especially in case of open surgery), and 4) the time period of collecting data samples, which could be heavily dependent on the number of surgeries conducted at a medical center, etc.

Simulator video data collection.

In collecting videos from simulators, some important factors are: 1) simulator development and pilot testing to ensure preferred observations (ie, cognitive vs technical skill), 2) recruitment of participants to spend operative time on the simulator, and 3) privacy approval processes (usually more relaxed than OR data collection), etc.

Preparing Datasets

Data labeling/annotation.

One of the most laborious processes of analysis is data labeling. This process requires the following: 1) deciding on the protocol of data labeling by the research team (eg,“the start of a suturing activity is the exact moment that a tool touches a tissue), 2) watching videos (data annotators), and 3) labeling the data. Software tools (eg, Computer Vision Annotation Tool [CVAT]48) can greatly facilitate data labeling. It is recommended that senior researchers of a project first start with labeling a portion of the data themselves to get a better understanding of the existing variations in the data and then help to develop a protocol to deal with variations. Once the primary rules and approach are identified and verified, the senior researchers can then delegate labeling tasks to data annotators to perform labeling of the entire dataset. In many cases, data annotators need to come back and receive help from senior researchers on how to make decisions on special cases that would not have been predicted on the initial labeling protocol.

Data organization.

Before feeding a dataset to an AI algorithm, some data organization is generally required. For instance, in surgical tool video classification, it may be beneficial to break the video clips into individual frames (of 1–3 second clips) and save them in different folders for different tools. This process requires some computer programming by the involved researchers.

AI development and Implementation

Depending on the subject of research, AI researchers might need to develop new AI algorithms. There are 2main software libraries available that facilitate the development of new AI algorithms (eg, Tensorflow49, PyTorch50). These libraries also provide well-known, already developed DL algorithms. In most research cases, the best option is to start with existing algorithms and fine-tune them on the new dataset. These libraries also provide learned weights of well-known algorithms trained on large-scale natural images, which gives researchers the ability to easily perform transfer learning.

Both newly developed AI algorithms and previously existing algorithms require a process called “hyperparameter tuning”to be trained well on a new dataset. There are multiple other hyperparameters that need to be considered, which are beyond the scope of this monograph. For instance, there are some “under the hood parameters” in the backpropagation process (Figure 2) that influence how well weights can be optimized during the training process.

Computational Cost

The period of time that it takes to execute an algorithm is called “computational cost.”

Training computational cost.

Asignificant amount of computational cost is related to training an algorithm as it requires optimization of weights. Each time we train an algorithm, it may take a few hours to a few days. It depends on many factors such as:the complexity of an algorithm (how many layers, how many neurons, and how those are connected), amount of data, using or not using transfer learning, as well as hardware capabilities. In addition, researchers usually must train their algorithms multiple times to perform hyperparameter tuning. Therefore, this process might range from a couple of weeks to months.

Testing computational cost.

Running a trained algorithm on a “testing” dataset is usually not time-consuming. You might expect a range between a few minutes or hours, depending on the complexity of the network, amount of data, and hardware used. Running an algorithm in real-time requires an algorithm to perform its prediction on a data sample at a speed of 30 samples per second. The Yolo algorithm is a good example of an algorithm that runs in real-time.

Reporting Results

Finally, reporting results to the community through presentations and writing manuscripts is another time period that is added to the research efficiency time.

Bias for Underrepresented Demographics

Another potential limitation for the use of AI in surgery is bias for underrepresented demographics. Inadequate representation of gender, racial, and ethnic populations and other groupings can create disparities by producing results and recommendations that are less effective or generalizable in these groups, yet are recommended for use in all populations.51This can further exacerbate the gap of healthcare inequities for minoritized populations. For example, one study evaluating structured assessments of male residents compared to females found that comments about males were more positive than those about females and contained a higher number of standout remarks. The result of more positive evaluations for male residents may impede the progress of female residents if they are seen as inferior in their ability to perform well.52This contrast suggests that implicit bias may be infused into both the qualitative and quantitative process of evaluating surgical residents. Moreover, if solutions are geared towards “fixing” the female residents instead of the evaluators, additional problems may arise in continuing the training bias and erroneously including the biased ratings in AI algorithms developed for resident assessment.

Solution

To circumvent issues of data bias, efforts should be made to include representative samples in training sets to avoid perpetuating implicit and explicit biases, as well as sample and diagnostic biases.53This can be accomplished by curating data to develop large data sets for analysis and specifically exploring physicians’ biases regarding diagnosis and treatment strategies and potential patient healthcare inequities. Additionally, data that is disregarded by the AI algorithm should be analyzed to appreciate trends and develop an understanding of why such data were excluded. Training all stakeholders to recognize bias and findsolutions to overcome this may aid greatly in training unbiased algorithms.

Lack of Infrastructure and Standards for Data Collection

Although capturing videos of surgical procedures is common practice in hospital settings, a well-established, standard approach is critically needed to collect, store, and label data in collaboration with AI researchers. Anecdotal reports from individual surgeons note workflow distractions, competing responsibilities, and variation in infrastructure support when trying to record and store operative video.

Solution

One potential solution is to implement a structured, team approach to data collection and storage. Involving other OR team members in the process would be greatly facilitated by promoting the team-based benefits of the data. In one hospital, the OR circulator inquires whether the team wants to record, begins the recording, and then saves the recording to the patient record via a picture archiving and communication system (PACS). To complement this process, the surgeon and other OR team members partner with hospital data managers and AI research teams to prioritize annotation and labeling strategies and reports are generated and shared. The focus on team level benefits supports this process.

Another potential solution involves the procurement and use of platform technologies that focus on healthcare data storage. An example of such a solution is the Philips HealthSuite Digital Platform (HSDP), which is a cloud-based solution that allows different data sources (eg, electronic medical record [EMR], x-ray images, magnetic resonance imaging [MRI] images, etc.) to be integrated into a single location where theycan be accessed by designated clinical teams and quality managers. The system provides support for analysis of the data and facilitates the design of patient-specific interventions.54

FINAL REMARKS

AI is the new electricity and is expected to significantly influence surgery in the upcoming decades. Incorporation of this technology requires collaboration between surgeon scientists, clinical teams and leaders, as well as AI researchers. Contributions from an integrated, interdisciplinary team will result in the best implementation outcomes and broader opportunities for significant, positive effects on surgical outcomes in the future.

GLOSSARY

Artificial intelligence (AI)

The theory and development of computer systems able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.

Accuracy

A performance metric defined as the ratio of “total correct predictions” to “total predictions.”

Activity

An action performed by a surgeon or a group of surgeons over a period of time.

Average precision (AP)

Object detection metric defined as the area under the precision-recall curve, plotted based on different confidence thresholds of detecting class of an object.

Backpropagation

The process of optimizing neural network weights to minimize prediction error of a neural network.

Computational cost

The period of time that takes to execute an algorithm.

Computer vision (CV)

The field of study that deals with gaining high-level understanding from digital images and videos.

Convolution (concept)

The process of sweeping a weight matrix over an image.

Convolutional neural networks (CNNs)

Neural networks that use convolution technique in their structure.

Data augmentation

Adding artificially generated data to a training set to address lack of data.

Deep learning (DL)

The field of developing algorithms of deep neural networks.

Deep neural networks (DNN)

Neural networks with a high number of layers, deep refers to many layers.

F1-Score

A performance metric calculated as a function (harmonic mean) of the metrics, recall and precision.

Fully connected network

A neural network where in the first layer each neuron is connected to all inputs and in a similar fashion in any other layer, each neuron is connected to all the other neurons in its corresponding previous layer.

Generative adversarial networks (GANs) (concept)

A type of deep neural network that can generate artificial images that are very similar to real images.

Ground truth

Expected output of an algorithm that is used for evaluation of testing data.

Hyperparameter tuning

The process of multiple parameters of an algorithm to be trained well.

Image classification

A process in which the inputs into an algorithm are images and the goal is to categorize the images based on their contents.

Intersection over union (IOU)

In object detection it quantifies an overlap between the ground truth and network output bounding boxes.

Labeled data

A type of dataset that, for each input data sample, there is a corresponding output sample.

Learning

The process of adjusting weights of a neural network according to observed data points.

Long-short term memory (LSTM)

A type of recurrent neural network.

Machine learning (ML)

The field of study to develop algorithms that use the idea of adjusting weights to learn.

Mean average precision (mAP)

Object detection metric of performance defined as average of the metric average precision (AP) for different classes.

Neural network (NN)

A structure consists of a large number of interconnected artificial neurons.

Object detection

Algorithms that are developed to: 1) spatially localize objects in an image and 2) categorize the identified objects into predefined classes.

Real-time

Providing prediction faster than 30 frames per second where a human eye cannot notice any glitch in a streaming video.

Recall

A performance metric defined as the ratio of “true positive” to “total ground truth positive.

Recurrent neural networks (RNNs) (concept)

Algorithms that have been developed to tackle temporal problems by learning information sharing weights.

Supervised learning

Machine learning algorithms that use labeled datasets.

Testing set

The dataset that a neural network will be tested on.

Training set

The dataset that will be used to adjust the weights of the network in the learning process.

Transfer learning

The process of using information learned by a network to initialize learning of another network.

Unlabeled data

A type of dataset for which there is no known corresponding output for any input data sample.

Unsupervised learning

The machine learning algorithms that use unlabeled datasets.

Video classification

A process where the inputs into an algorithm are video clips and the goal is to categorize the video clips based on their contents.

Weights

Adjustable variables of an artificial neural network that get optimized during the learning process of a neural network.

You only look once (Yolo)

A common type of object detection algorithm.

Biography

Hossein Mohamadipanah, PhD, serves as a Senior Research Engineer in the Technology Enabled Clinical Improvement (T.E.C.I.) Center at Stanford University School of Medicine. Dr. Mohamadipanah completed his PhD in medical robotics and machine learning at Oklahoma State University. He is interested in developing artificial intelligence (AI) algorithms to address surgical challenges. His research focuses on developing video classification algorithms and addressing unbalanced data challenges using generative adverserial networks (GANs).

Calvin Perumalla, PhD, joined the T.E.C.I. Center team as a Postdoctoral Researcher. He received his masters and doctorate degrees in electrical engineering from the University of South Florida. His graduate research work involved building a novel cardiac rhythm monitor with enhanced diagnostic capabilities. He is passionate about using AI to improve the quality of human life and his current research interests include computer vision, image segmentation, and surgical data science.

Su Yang, BS, graduated from the University of Wisconsin-Madison, receiving his BS in Electrical Engineering with a sub focus in Computer Science. He joined the T.E.C.I Center after his graduation and spearheaded many of the projects in surgical simulation by applying his knowledge in electrical circuitry, database management, and statistical methodologies. His current research focus is on the development analysis of multimodal data sets to explore the paths to quantify surgical expertise.

Brett Wise, BS, graduated from the University of Wisconsin-Madison, receiving his BS in neurobiology. His responsibilities in the T.E.C.I Center include developing and fabricating simulators such as a bleeding pelvic tumor model, tourniquet training system, and breast simulators, as well as assisting with data analysis and data collections. His research interests lie in the development of wearable fabric sensors and working to understand how AI and virtual reality can impact the medical field.

LaDonna Kearse, MD, graduated from the University of Maryland-College Park with a BA in biological sciences and concentration in neurobiology & physiology, before earning her medical degree from Howard University College of Medicine. She has completed two years of general surgery residency training at Howard University Hospital. Her current interests include the integration of technology and simulation to enhance surgical training, developing individualized learning pathways, and improving patient outcomes.

CassidiGoll, BA, is the Marketing and Administrative Coordinator in the T.E.C.I. Center at Stanford University School of Medicine. She graduated from the University of Wisconsin-Madison, receiving her BA from the School of Journalism and Mass Communications in public relations and strategic communications, with a certificate in entrepreneurship. As a health care communicator, Cassidi is interested in exploring optimal multi-channel methods for reaching health care professionals, medical students, and patients in a feverish digital environment.

Anna Witt, BS, is the Lab Manager of the T.E.C.I. Center, where she leverages her leadership and management interests to create a productive and collaborative work environment for students, postdoctoral fellows, and research staff. She graduated from the University of Wisconsin with a double major in biology and environmental studies and a certificate in mathematics. Her research interests lie in developing innovative research protocols that enable large-scale data collection of surgical performance data.

James R. Korndorffer, Jr. MD, MHPE, is Associate Professor and Vice-Chair of Education at Stanford University School of Medicine. He received his undergraduate degree in biomedical engineering from Tulane University, his MD from the University of South Florida College of Medicine, and his MHPE from the University of Illinois-Chicago. His clinical interests include minimally invasive surgery for gastrointestinal disorders and hernias. His research interests include surgical education, surgical simulation, patient safety, and patient care quality.

Carla Pugh, MD, PhD, is the Thomas Krummel Professor of Surgery at Stanford University School of Medicine. Her clinical area of expertise is acute care surgery. Dr. Pugh obtained her medical degree at Howard University School of Medicine and, upon completion of her surgical training at Howard University Hospital, went to Stanford University and obtained a PhD in Education. Her research interests relate to the use of advanced engineering technology and computer science approaches to advance data sharing in medical and surgical education, change measurement culture in healthcare, and improve patient outcomes.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Hossein Mohamadipanah, Stanford University, Stanford, California.

Calvin Perumalla, Stanford University, Stanford, California.

Su Yang, Stanford University, Stanford, California.

Brett Wise, Stanford University, Stanford, California.

LaDonna Kearse, Stanford University, Stanford, California.

Cassidi Goll, Stanford University, Stanford, California.

Anna Witt, Stanford University, Stanford, California.

James R. Korndorffer, Jr., Stanford University, Stanford, California.

Carla Pugh, Stanford University, Stanford, California.

References

RESOURCES