Abstract
Recent advances in neural networks have given rise to generative artificial intelligence; systems able to produce fluent responses to natural questions, or attractive and even photorealistic images from text prompts. These systems were developed through new network architectures that permit massive computational resources to be applied efficiently to enormous datasets. Firstly, we examine autoencoder architecture, and its derivatives the variational autoencoder and the U-Net, in annotating and manipulating images and extracting salience. This architecture will be important for applications like automated x-ray interpretation, or real-time highlighting of anatomy in ultrasound images. Secondly, we examine the transformer architecture in the interpretation and generation of natural language, as will be useful in producing automated summarization of medical records, or performing initial patient screening. We also apply the GPT-3.5 algorithm to example questions from the ABA Basic Examination and find that, under surprisingly reasonable conditions, it correctly answers more than half the questions.
Summary Statement
We review new architectures for machine learning that have greatly improved performance for generative artificial intelligence. What anesthesiology applications are clinically plausible? Testing the GPT-3.5 algorithm against the ABA Basic Examination provides a practical challenge.
Introduction
The field of artificial intelligence (AI) and machine learning has, since its inception, been marked by surprising advances interspersed with long periods of re-entrenchment colloquially known as the AI winters. Where we presently stand in this cycle is a matter for frequent speculation in both technical journals and the lay press. Nevertheless, it seems undeniable that there have been further astonishing gains in the performance of neural networks on natural language and image tasks since even the publication of two recent review articles in Anesthesiology. In the first of these reviews, an integrative review was performed of the mathematics of decision-making and its application to the practice of anesthesiology.1 In the second review, a systematic review of published manuscripts employing artificial intelligence and machine learning within the scope of anesthesiology was performed.2 This article will discuss the advances in the large-scale architecture of neural networks that have made recent gains possible, building upon conference proceedings, pre-press archives, and medical journals.3 Historically, AI has dealt with the problem of making decisions or predicting outcomes only from a limited repertoire of choices, such as what best move to make in chess from the legal moves available. However, this review will highlight advances in the field of generative AI. Generative AI produces natural, open-ended output typically in the form of text or images, in response to user prompts or queries, guided thematically by patterns or examples that it has discerned in a large underlying knowledge database.
While it can certainly be helpful to have an understanding of what occurs at the small-scale of a neural network and how that mathematical behavior leads to machine learning and decision-making, the latest advances in generative AI have come about not by improvements in the theory of neural networks at the small-scale, but rather through the development of new large-scale network architectures that have allowed the efficient application of massive computational power to extraordinarily large training datasets. This, in turn, has led to the ability to train larger neural networks with orders of magnitude increase in the number of network connections. Coupled with the ability to harness vast quantities of publicly available unlabeled data for training through mimicry, these advances have produced obvious improvements even in tasks for which these systems were not explicitly trained (known as zero-shot learning). [Palatucci M, Pomerleau D, Hinton GE, Mitchell TM: Zero-shot learning with semantic output codes. Advances in Neural Information Processing Systems, 2009.]
It appears that there are two main generative AI architectures that are the most important to understand. The first of these is autoencoder and its derivatives the variational autoencoder and the U-Net, which are useful for annotating and manipulating images, and for extracting salience. Salience, for our purposes, means recognizing the most important clinical findings from amongst a potentially vastly larger pool of unremarkable information. The second is the transformer architecture, which underpins large scale language models such as ChatGPT (https://openai.com/chatgpt) and which has enabled systems capable of producing fluent conversational prose through efficient training against terabytes of example text. Even though the details of these networks and algorithms are very complex, it is fortunately possible to come to a qualitative and intuitive understanding of their behavior without having to delve too deeply into the underlying mathematics.
Generative AI seems to bear most closely on interpretative medical specialties, such as diagnostic radiology or anatomic pathology, in which the overall cognitive act is to take some set of one or more images or image sequences obtained from a patient, reconcile these against the patient’s medical history as abstracted from a written medical record, and produce a textual interpretation. This professional act maps well to the idea of a function that operates on image and text inputs and produces a text output. Such an act is virtual, in the sense that direct interaction with the real-time physical world of objects and people is not required. Each cognitive act can be isolated, and the interpretation of a single case is separable from any other. Large libraries of representative digitized images and their associated interpretations can be assembled for training, and the outcome of an interpretative decision-making algorithm can be assessed unambiguously against the professional interpretation that was rendered at the time. Consequently, it is easier to imagine a future in which some form of machine-learning based system becomes able to produce an output that is no longer distinguishable from the work of a human specialist. The Turing test, a test of a machine’s ability to exhibit behavior indistinguishable from intelligent human behavior, was originally named the Imitation Game by English mathematician Alan Turing in 1950.4 Even today, advanced machine learning and artificial intelligence algorithms applied in radiology can pass the Turing test on certain constrained tasks.5
Autoencoders and Image Mimicry
It is probable that we will soon have real-time AI guidance for imaging tasks such as ultrasound guided nerve blocks or line placement. Autoencoders, in the form of U-Nets, will underlie these guidance systems because they provide an efficient technique for learning to identify and highlight important structures in an image. An autoencoder is a network architecture that allows a machine learning algorithm to determine, through mimicry, what structures or parts of an image are most salient.
As shown in Figure 1A, the network architecture is constructed such that the input image is progressively compressed into a smaller and smaller amount of data, then re-expanded and compared to the original image. The bottleneck, containing the smallest amount of data, is sometimes referred to as the latent space. The compressed data encoded here are often referred to as latents. Because this bottleneck is tight, the autoencoder network is forced to learn an efficient means of representing the input images so that the most salient information can be communicated from the encoder side to the decoder side. If the autoencoder has learned to work well, then the input image and the output image will be very similar. If the size of the bottleneck is well chosen, neither too big nor too small, then every item of latent data will be necessary and valuable. An autoencoder network is therefore a means by which a neural network can learn to infer what is most pertinent about a class of objects by attempting to mimic members of that class of objects under constrained conditions.
Figure 1:
Practical variations on the Autoencoder Neural Net architecture, which can be used to learn to mimic or extract salience from images.
(A) The standard autoencoder, in which the network must produce an output image x̂ that accurately mimics the input image x. However, only a limited amount of data is allowed to pass through a bottleneck, so an efficient representation of the data must be learned at the bottleneck point z (known as the latent space).
(B) The variational autoencoder, in which the form of z is constrained to be representative of an underlying, interpolatable Gaussian probability distribution. The constraint that z must be represented in the form of a smooth interpolatable distribution implies that random latents in z that lie between true examples will still tend to decode to plausible examples. This architecture allows the generation of completely new output images ŷ that are representative of the training data by separating away the decoding portion and inserting random values at the input to z.
(C) The U-Net architecture: instead of recreating the input image (x̂), the network instead learns to mimic the markup that an expert would have applied to the image (m̂). Adding long skip connections above z, as shown, can improve the quality of the output image, but this means that the network is no longer separable at z and cannot generate new images in the way that the variational autoencoder can. However, this restriction is no sacrifice because it is not meaningful to generate markup images for input images that don’t exist.
Autoencoders can also be used to learn how to generate markup or identify structures in an image. Suppose we wish to develop a neural network to identify the brachial plexus in ultrasound images of the neck in real-time. We might begin by assembling a dataset that contains both raw image examples of sonoanatomy of the neck and accompanying corresponding expert annotations that mark up the true location of the brachial plexus in each image. Previously, the goal in training the autoencoder was to produce good approximations of the original image, but now the training goal is to produce good approximations of the expert markup when given the corresponding raw ultrasound image. The bottleneck learns to distil the salient aspects of the input image in order to generate the markup. This form of autoencoder is known as a U-Net, as shown in Figure 1C, and has achieved great popularity in application to image interpretation in the specialties of radiology and pathology.6 This technology is also suitable for real-time application, as would be the requirement for its use in anesthesiology for procedures such as line or block placement. Although training a network may require large datasets and prolonged computation, applying an existing network to a raw image to produce a markup output is a computationally straightforward and brisk task. This would appear live to the user since an efficient implementation of a U-Net can be evaluated in less time that it would take to display one video frame of ultrasound onscreen. It is relatively easy to visualize how a markup image might be transparently layered over a raw ultrasound image to provide procedural targeting guidance, such as highlighting the brachial plexus, in a similar manner to how color doppler can be overlaid in real-time over an ultrasound image to show flow.
Autoencoders can also be used for image generation and manipulation, and this technique could be extended into the medical realm to produces images that represent particular degrees of pathology. We might speculate that we could generate new examples of images by keeping only the decoder portion of the autoencoder network, supplying random latents and decoding them to new images. In practice, this approach does not immediately work because we have not yet guaranteed that random latents that lie between true latents still represent viable examples. However, if we add specific mathematical constraints to the latents (i.e., they must be represented in the form of a smooth underlying probability distribution), then the random latents will tend to decode to plausible examples. [Razavi A, Van den Oord A, Vinyals O: Generating diverse high-fidelity images with VQ-VAE-2, Advances in Neural Information Processing Systems. Vancouver, CA, 2019.] This architectural innovation is called a variational autoencoder, as shown in Figure 1B. [Kingma DP, Welling M: Auto-encoding Variational Bayes. arXiv preprint arXiv:1312.6114, 2013.] Suppose that we train a variational autoencoder network on images of human faces, subdividing our training data into human faces that are smiling versus unsmiling. By looking at the average latent representation of these two cohorts, the variational autoencoder can learn to extract a trend between unsmiling and smiling latents. We might next take an image of an unsmiling person, encode the image to its latent at the bottleneck, and then modify the latent to push it more in the direction of smiling. Decoding this latent should produce a new modified image in which the original person now appears to be smiling. While this is obviously useful and available to modern consumers for the manipulation of photographic images, we have, on a deeper level, also learned an algorithmic representation of the semantic concept of smiling. [White T: Sampling generative networks. arXiv preprint arXiv:1609.04468, 2016.] If we were to apply this technique to a dataset of medical images of varying degrees of pathology, we might be able to train a variational autoencoder to learn what trends in appearance correspond to worsening or healing of those lesions. Then, given a new real patient image, we might be able to use the trained variational autoencoder to generate a sequence of illustrations or an animation to show how that particular patient’s lesion might be expected to worsen or resolve over time.
Transformer Networks and Language Mimicry
Natural language tasks such as speech recognition, translation, and text generation are operations on sequences of data. A simple example is the autocomplete system on most smartphone keyboards which, based upon the preceding text, offers options for the word (or words) most likely to come next in sequence. One can simply tap on the autocomplete button again and again to emit a stream of natural language, although the text produced is typically meandering and nonsensical. When the next word is chosen depending only on the one or two preceding words, the algorithm has almost no train-of-thought to speak of.
A recurrent network is one in which some of the outputs or hidden internal states of the network are recycled back to the input stage of the network. These pathways allow the output of the network to depend not just on its current inputs but also on the previous behavior of the network, which affords the network a property of train-of-thought and hence the ability to observe and respond to evolving trends. In a previous review,1 the concept of recurrency was introduced as illustrated by a simple example of a network that might decide whether to transfuse blood based not only upon the immediate systolic blood pressure and heart rate but also upon the previous calculations that it had made. Although recurrent neural networks have been developed for translation or speech recognition with some success, they do have significant limitations. The recurrent design treats all past information essentially equally, but is clear that some past information is likely to be very much more important than others, and so critical contextual information may well have been lost by the time it is needed. Recurrent networks can be slow to train because it is not possible to train for a future time step until all the antecedent steps are completed. Recurrent networks therefore cannot easily benefit from the construction of massively parallel datacenters for machine learning training. Recurrent networks are also relatively difficult to train: when a prediction error occurs, the appropriate error signal for training may have to arrive from several cycles or steps before. The pathway from the point at which the error occurred to the point at which its effect became apparent can become too long for reliable machine learning.
Many of these problems are solved by the design of the transformer network architecture. [Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I: Attention is all you need, 2017 Conference on Neural Information Processing Systems. Edited by Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R. Long Beach, CA, 2017.] Rather than having any defined recurrent paths or stream processing, the transformer network uses a very large data set without specific paths, known as a “unrolled input space.” At each stage, the network is able to access the inputs from all other stages. This architecture creates a direct path from any previous state, rather than requiring that information to persist within the network through multiple cycles. The unrolled nature of the network means that it is much more capable of being parallelized, and so massive computational resources can be brought to bear on its training. Any difficulties with the ordering of inputs (for example, differences in grammatical word-ordering when translating between languages) are greatly reduced, as the machine learning algorithm can simply direct its attention to any part of the unrolled input space as needed. However, the downside is that the network becomes very large and the number of individual connections that it contains can become extraordinary. Transformer networks composed of more than a trillion parameters are now feasible,7 but require access to enormous quantities of training data to avoid becoming overtrained. A fascinating question is whether such networks learn only patterns of relationships between words, or whether factual knowledge and fluency can be encoded distinctly. This is important for medical applications, in which there is an especially high bar for factual correctness.8
A current example in natural language transformer architecture is the GPT-3.5 network, [Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A: Language models are few-shot learners, 2020 Conference on Neural Information Processing Systems] a large language model with 175 billion internal network connections and an ability to manipulate an input length of up to 2048 words. The input length is a design parameter of the system, and implicitly defines the maximum possible length of the system’s train-of-thought. The input word sequence is first converted into tokens. Tokens disambiguate multiple meanings of the same word. For example, possible meanings of the noun “club” include an item of sporting equipment, a crude weapon, a discotheque, a group of like-minded individuals, and a type of sandwich. [Yarowsky D: Unsupervised word sense disambiguation rivaling supervised methods, 33rd Annual Meeting of the Association for Computational Linguistics, 1995.] In anesthetic practice, similar semantic ambiguities can arise. For example, “MAC” may refer to a unit of volatile anesthetic concentration, a mode of anesthetic practice, a type of laryngoscope blade, or a type of central venous catheter. A “PE” may conceivably refer to a pulmonary embolism, a pleural effusion, a physical examination, or a physician extender. Each possible meaning is represented by a different token.
At each stage, the network is able to access the internal states from all previous stages. GPT-3.5 was trained on a natural language dataset that incorporates:
the English-language section of Wikipedia.
two large collections of books, titled simply books1 and books2. Controversially, the actual contents of these corpora have not been disclosed. This has led to legal complaints that the dataset may contain currently copyrighted works whose use has not been approved by the original authors.
the Common Crawl dataset (which comprises more than a trillion words of text acquired from the internet).9 Common Crawl comprises around 45 terabytes of compressed plaintext, which was filtered and deduplicated to 570 gigabytes for training.
For comparison, the Complete Works of William Shakespeare comprise a mere 2.1 megabytes of compressed plaintext. [Shakespeare W: The Complete Works of William Shakespeare. Project Gutenberg, Salt Lake City, UT, 1994. (https://www.gutenberg.org/ebooks/100)] A 570 Gb corpus of compressed is therefore more than a quarter of a million times larger than Shakespeare’s whole oeuvre. In contrast to the stilted performance of recurrent text generators, the performance of large language models such as GPT-3.5 is uncannily appropriate and fluent in responding to user prompts.10 This improvement in fluency is due to both the large size of the dataset against which the model is trained, the architectural advantages in the transformer neural network that permit such large datasets to be employed, and the use of an unrolled input space that allows important information to be easily recalled into the conversation. However, the neural network does not continually train on new data and so, for example, it cannot incorporate new knowledge (e.g., current affairs, new content) that became available after its training was completed. Post-hoc training and moderation is also necessary so that the language model avoids responding to prompts with support for antisocial sentiments or actions.
Associating Images with Natural Language: Describing and Creating
Could it be possible to combine these preceding ideas to design a network architecture that learns to produce textual descriptions of images by mimicry? In the description of autoencoders, we described how a neural network could learn to generate image markup of salient features. However, such markup is only equivalent to a radiologist putting an arrow on a suspicious finding in an image with no further explanation. Ideally, we would also want to receive a text description of what was found, why that finding is important and what the subsequent clinical implications might be. Clearly, it would be useful to generate text outputs from image prompts in pursuit of the automation of medical image interpretation. In anesthetic practice, this might be applicable to providing immediate interpretation of transesophageal echocardiogram images taken during cardiac surgery, or to an interpretation of findings from a bronchoscopy on an intubated ICU patient in respiratory failure.
One approach to generating textual descriptions of images is the Contrastive Language-Image Pre-training (CLIP) network architecture. [Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J: Learning transferable visual models from natural language supervision, International Conference on Machine Learning 2021.] CLIP learns how to describe images by training against a dataset of images paired with plaintext descriptions, i.e. each element in the dataset is structured as [an image, its caption]. Its training dataset contains 400 million such pairs. Each image is passed through an encoder (as in the first part of the autoencoder) to produce its equivalent image latent. A similar operation is performed on the text description to produce a text latent using an earlier pre-trained version of a GPT natural language transformer network (a large language model with 63 million internal network weights). The CLIP architecture then serves as a form of switchboard, in which it attempts to learn functions that optimally connect each image latent to its single corresponding text latent while disfavoring mismatches between non-paired latents. This training process is described as self-supervised representation learning. For example, the architecture learns to associate image latents of cats with the text latent of the word “cat” and image latents of dogs with the text latent of the word “dog”, while reducing confusion between the two concepts. Once the algorithm has been trained, an image can be presented as its latent, and the CLIP algorithm selects the text latent that most corresponds to this image latent based upon the data that it previously saw. This most closely matching text latent is then decoded to produce the output description of the supplied image. If the image has not previously been seen, then this test is an example of zero-shot learning because the algorithm is being asked to perform a task that it has not previously encountered. If this algorithm can learn to describe arbitrary photographs from the internet, could a similar technique be applied to paired datasets of radiological images and their professional interpretations with the intention of producing a system capable of interpreting such images automatically? A recent publication on automating the interpretation of chest X-rays suggests that this is a viable pathway, with the resultant algorithm able to attain performance comparable to the level of trained radiologists in the identification of the five tested pathologies of atelectasis, cardiomegaly, consolidation, edema and pleural effusion.11 However, a recent study in dermatology provides a telling counter-example. The role of the machine learning process is to produce a network that is as accurate in its predictions on the training data as possible. Thus, it is to be expected that this learning process will seize upon any unfortunate “tells” that lurk in the training data. Given images of benign and malignant skin lesions, a network designed to discriminate their appearance instead primarily learned to associate the concept of malignancy not with the appearance of the lesion itself but with whether a ruler was present in the image or not.12 One can imagine similar problems that could arise with the interpretation of chest x-rays in anesthetic practice: a neural network might associate the presence of an endotracheal tube with a diagnosis of respiratory failure, or a chest tube with pneumothorax, rather than evaluating the appearance and pathology of the lungs themselves.
It is also possible to produce neural networks that work in the opposite direction, generating image outputs from text prompts. The results are often whimsical, startling and even strikingly beautiful. These networks seem less immediately relevant to current clinical practice, but nevertheless one can still imagine useful applications such as producing personalized medical illustrations on demand to help communicate with patients, or the possibility of generating a fund of plausible but artificial patient imagery for the purposes of didactics or examinations without the concerns that attend to the handling of protected health information.
Generative AI for images proceeds by starting from a random, noisy image and repeatedly cleaning and improving it so that its description approaches the prompt provided by the user. It is straightforward to create a large training dataset for denoising: one can simply add random noise to existing images. The neural network is then trained to remove noise.[Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.][Ho J, Jain A, Abbeel P: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 2020.] The latent of the image is then guided towards the latent of the user’s prompt. By repeated application of this guided denoising process, an output image gradually emerges that usually does indeed represent the text prompt. [Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.] Essentially, the generation of the output image from an initial block of raw noise is evocative of a quote supposedly attributed to the sculptor Michelangelo: “The sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material.” The random image noise at the start is progressively shaped away to produce an output image that is understood to be similar in meaning to the text prompt.
The most well-known generative AI systems for artwork presently are DALL·E 2 (https://labs.openai.com), Stable Diffusion (https://www.stability.ai), Midjourney (https://www.midjourney.com) and Firefly (https://firefly.adobe.com). As an example, Figure 2 shows two images generated in response to the prompt “Anesthesiologists drinking coffee in their break-room in the style of The Night Watch by Rembrandt.”
Figure 2:
Example output from the natural language guided image generation AI, DALL·E 2.
(A,B) Two images produced by the OpenAI system DALL·E 2 in response to the natural language prompt “Anesthesiologists drinking coffee in their break-room in the style of The Night Watch by Rembrandt.”
(C) For comparison, a section of Rembrandt’s The Night Watch (1642); digital image placed in the public domain by the Rijksmuseum, Amsterdam, NL.
Generative AI and the Cognitive Tasks of Anesthesiologists
What cognitive tasks form a subset of the current human practice of anesthesiology that might be amenable to being improved by machine learning, or even being surpassed? Here are three further suggestions.
Firstly, one task might be automating the process of an interactive, patient-centered pre-operative assessment. This is distinct from classic risk prediction, which is fundamentally a closed-end supervised learning activity. Might it be possible, given 30-day mortality as an outcome variable and access to the full text of medical records, for a generative AI algorithm to learn how to outperform current preoperative evaluation and assessment techniques, perhaps even by recognizing harbingers of undiagnosed co-morbidities in the patient’s longitudinal record? This is an area that is rich for exploration based on both the large input datasets available for harvesting from electronic medical records and the unambiguous nature of mortality as an outcome endpoint13. Rather than focusing on risk prediction, a patient-centered approach might use a back-and-forth interaction between a language model and a patient to identify clinical characteristics associated with elevated 30-day mortality via simulation of a probing conversation about the patient’s medical history. This would allow a much more nuanced and sophisticated approach than permitted by a simple checkbox-style review-of-systems. Secondarily, this type of automated, interactive preoperative assessment would likely also be a better risk stratification system than the current ASA physical status classification,14 which is a poor predictor of the outcome for individual patients. The ASA physical status classification system demonstrates a poor positive predictive value of only about 3% for Status IV and V combined for 30-day mortality.15 Better performing models have been constructed, at least based on retrospective data.16
Secondly, the ability to evaluate the current intraoperative state of a patient might be improved. In chess, the value of a position describes whether it is winning or losing, and possible moves can be judged by the degree to which they improve or worsen the current value. Is it possible to value the current status of a patient, and detect whether a risk of catastrophic deterioration is at hand? This is not a new idea; the literature contains many examples of attempts to quantify patient status from a limited number of physiologic variables.17–20 Access to a large database of intraoperative anesthesia records with multimodal data might allow us to learn better warning systems for incipient catastrophic deterioration of a patient’s condition. As an initial step, it may be useful to produce a warning system that recognizes significant deviations away from the normal progression of a particular surgery, thus alerting for the need for expert intervention. This might be achieved by categorization based on the records of the large number of normal surgeries that exist: thus the single output would be that “this case is proceeding normally” versus “this case has become abnormal”. The ability to simply recognize the presence of significant clinical deviations is one of the important early skills acquired during residency.21 By learning to recognize “normal” as a first step, one might potentially avoid the need to learn to classify every possible intraoperative pathology upfront. Later systems might learn to advise interventions that might be taken in mitigation.
Thirdly, what might be the impact of artificial intelligence on the process of professional licensure? To what extent can a machine learning algorithm demonstrate apparent understanding of the basic knowledge of anesthesiology? In fact, a preliminary assessment can already be made. The American Board of Anesthesiologists (ABA) has released a public example of a sample set of sixty questions from the ABA Basic Examination. [The American Board of Anesthesiology: BASIC Examination Questions with Answer Key. Raleigh, NC, 2020. (https://theaba.org/pdfs/BASIC_Questions.pdf)]
In the context of this manuscript, the author sought to evaluate the potential of modern AI tools. Each ABA Basic Examination sample set multiple-choice question and its possible answers were submitted in turn to the OpenAI ChatGPT service, an openly accessible implementation of the GPT family of large language models. ChatGPT’s response was recorded unmodified, and a determination was made as to whether the question was answered correctly. Three of the sample ABA Basic questions involved the interpretation of a figure, which is outside of the scope of the service, leaving the remaining 57 questions to be assessed. The supplemental appendix contains these responses, as obtained using the version of the ChatGPT service active on December 9th 2022. ChatGPT encountered no difficulties with following the form of an examination in multiple choice format. Although many responses were factually incorrect, the algorithm always produced fluently argued replies that would likely be sufficient to convince a layperson. Its explanations did not wander into irrelevance and it did not contradict itself. Troublingly, even when incorrect, the algorithm never indicated that it entertained any degree of uncertainty about its answer. The algorithm only struggled to express a clearly preferred response in one instance (question 19); otherwise, its responses were unambiguous. This is an interesting test of zero-shot learning, since ChatGPT has presumably not encountered these questions before and is not in any sense optimized to process questions about anesthesiology. Given 57 multiple choice questions with four possible responses, we would expect that a system operating merely by chance would obtain around 14 correct answers; this is the null hypothesis H0. To accept a hypothesis H1 that the system possesses at least some of the basic knowledge of anesthesiology, we would wish to reject H0 with P < 0.05. Using the cumulative binomial distribution, this level is reached at 20 or more correct answers out of 57 (P = 0.032). ChatGPT performed substantially better than that, responding correctly to 27 questions out of 57 (P = 0.000073). Although the ABA releases passing rates for its examinations (i.e. the percentage of applicants taking the test who pass), the passing grade on any particular examination (i.e. the percentage of questions answered correctly that constitutes a pass) is never made publicly available. It would be very interesting to know the historical performance of applicants on the questions in this example set.
There are additional remarkable factors to this performance. The algorithm made good efforts even at questions that are not well constructed. The questions in the example set are, presumably, questions retired from the ABA question bank. Some questions have become inaccurate with the passage of time and advances in anesthesia technology (e.g. pressure support ventilation used to be triggered by a decrease in airway pressure, but modern ventilators now more commonly feature inspiratory flow rate triggering.) Some of the questions in the example set are written in a style that does not comply with the current guidelines for ABA exam question construction. Questions are required to have one, unambiguously correct answer that a skilled applicant could potentially answer even without seeing the available responses. This guideline excludes asking open-ended prompts and negative-sense questions. If six non-compliant questions (questions 15, 19, 39, 43, 46 and 58) are excluded from the assessment of ChatGPT’s performance, then 27 out of 51 questions were correctly answered. This is more than half right (52.9%), and corresponds to P = 0.0000049. It is an impressive example of zero-shot performance, against previously unseen data. One wonders, comparatively, how a man-in-the-street or even a fourth-year medical student might have fared on this professional licensing examination. Indeed, ChatGPT-3.5’s performance here is only a lower bound. How much improvement might be obtained if the algorithm were given explicit refinement training against anesthesia reference materials or practice tests, rather than being made to sit this examination essentially cold? This process of taking a generic pre-trained language model and then repurposing it to a specific task by further training on a small body of specialized, high-quality data is known as transfer learning.22
The ABA Examinations – including the Basic, Advanced, and Applied examinations – are intended to test judgment and interpretative ability, not just simple recall. Thus, the apparent performance of natural language algorithms like GPT-3.5 is unsettling. Perhaps this is the end of the era in which multiple-choice examinations can be used as discriminants of professional aptitude. Testing efforts may need to be refocused on the ability of the applicant to evaluate the status of a patient correctly and to perform critical decision-making and physical actions under abruptly evolving conditions, to test understanding and ability more than fluency. New assessment tools may be needed to draw that line fairly and appropriately.
In summary, we have reviewed the architectures of modern machine learning networks, how these architectures give rise to generative AI, defined as the ability to interpret or synthesize image or text outputs in response to image or text prompt inputs, and how those abilities might be harnessed as interpretative skills useful to clinical practice. However, the specialty of anesthesiology appears less like isolatable acts of interpretation and more like a closely-linked interactive sequence of decisions and practical physical actions that must be undertaken under the pressure of real-time. It can be difficult to determine whether a particular decision in anesthetic management was correct or not: how would we estimate how a particular case might have ultimately evolved if a different decision had been made? Even the most sophisticated self-learning algorithms, as tested against chess, Go and Atari video games, still require a virtual world in which interactive actions and strategies can be tested and replayed and trialed over millions of attempts.23 Thus, the dependence of anesthetic practice on physical actions and real-time assessment make it harder to imagine a near future in which a machine learning system becomes able to impersonate the actions and decision-making of an anesthesiologist: a Turing test in anesthesiology will remain hard to develop and even harder to pass.
Supplementary Material
Acknowledgements
The author thanks:
Dr. Aurora Burds of the Massachusetts Institute of Technology for providing feedback on this article during its writing.
Dr. James Rathmell of Brigham and Women’s Hospital for comments and discussion on professional evaluation and examination.
Funding
NIH R01 GM121457
NIH R35 GM145319
Departmental
Footnotes
Conflicts
Dr. Connor has consulted for Teleflex, LLC on issues regarding airway management and device design and for General Biophysics, LLC on issues regarding pharmacokinetics. These activities are unrelated to the material in this manuscript. The author also holds the following patents related to aspects of the use of computers in the practice of anesthesiology. The material in this article does not pertain to these inventions.
US Patent 8460215 Systems and methods for predicting potentially difficult intubation of a subject
US Patent 9113776 Systems and methods for secure portable patient monitoring
US Patent 9549283 Systems and methods for determining the presence of a person
Prior Presentations
Presented in part by the author at the 2022 ASA Annual Meeting in New Orleans as part of a 60-minute Fundamentals of Anesthesiology lecture entitled “Artificial Intelligence and Machine Learning for Anesthesiologists”
References
- 1.Connor CW: Artificial Intelligence and Machine Learning in Anesthesiology. Anesthesiology 2019; 131: 1346–1359 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hashimoto DA, Witkowski E, Gao L, Meireles O, Rosman G: Artificial Intelligence in Anesthesiology: Current Techniques, Clinical Applications, and Limitations. Anesthesiology 2020; 132: 379–394 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kharasch ED: Non-Peer-reviewed Preprint Articles as References in Anesthesiology: Reply. Anesthesiology 2021; 134: 821. [DOI] [PubMed] [Google Scholar]
- 4.Rider RE: A mathematician: Alan Turing. Science 1984; 223: 807. [DOI] [PubMed] [Google Scholar]
- 5.Ouyang H, Meng F, Liu J, Song X, Li Y, Yuan Y, Wang C, Lang N, Tian S, Yao M, Liu X, Yuan H, Jiang S, Jiang L: Evaluation of Deep Learning-Based Automated Detection of Primary Spine Tumors on MRI Using the Turing Test. Front Oncol 2022; 12: 814667. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Shaukat Z, Farooq QUA, Tu S, Xiao C, Ali S: A state-of-the-art technique to perform cloud-based semantic segmentation using deep learning 3D U-Net architecture. BMC Bioinformatics 2022; 23: 251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Fedus W, Zoph B, Shazeer N: Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 2022; 23: 1–39 [Google Scholar]
- 8.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Scharli N, Chowdhery A, Mansfield P, Demner-Fushman D, Aguera YAB, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V: Large language models encode clinical knowledge. Nature 2023; 620: 172–180 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 2020; 21: 1–6734305477 [Google Scholar]
- 10.Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM: Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 2023; 183: 589–596 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Tiu E, Talius E, Patel P, Langlotz CP, Ng AY, Rajpurkar P: Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat Biomed Eng 2022; 6: 1399–1406 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Narla A, Kuprel B, Sarin K, Novoa R, Ko J: Automated Classification of Skin Lesions: From Pixels to Practice. J Invest Dermatol 2018; 138: 2108–2110 [DOI] [PubMed] [Google Scholar]
- 13.Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ, Liu X, Marcus J, Sun M, Sundberg P, Yee H, Zhang K, Zhang Y, Flores G, Duggan GE, Irvine J, Le Q, Litsch K, Mossin A, Tansuwan J, Wang D, Wexler J, Wilson J, Ludwig D, Volchenboum SL, Chou K, Pearson M, Madabushi S, Shah NH, Butte AJ, Howell MD, Cui C, Corrado GS, Dean J: Scalable and accurate deep learning with electronic health records. NPJ Digit Med 2018; 1: 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Dripps RD, Lamont A, Eckenhoff JE: The role of anesthesia in surgical mortality. JAMA 1961; 178: 261–6 [DOI] [PubMed] [Google Scholar]
- 15.Horvath B, Kloesel B, Todd MM, Cole DJ, Prielipp RC: The Evolution, Current Value, and Future of the American Society of Anesthesiologists Physical Status Classification System. Anesthesiology 2021; 135: 904–919 [DOI] [PubMed] [Google Scholar]
- 16.Li G, Walco JP, Mueller DA, Wanderer JP, Freundlich RE: Reliability of the ASA Physical Status Classification System in Predicting Surgical Morbidity: a Retrospective Analysis. J Med Syst 2021; 45: 83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hope CE, Lewis CD, Perry IR, Gamble A: Computed trend analysis in automated patient monitoring systems. Br J Anaesth 1973; 45: 440–9 [DOI] [PubMed] [Google Scholar]
- 18.Harrison MJ, Connor CW: Statistics-based alarms from sequential physiological measurements. Anaesthesia 2007; 62: 1015–23 [DOI] [PubMed] [Google Scholar]
- 19.Hatib F, Jian Z, Buddi S, Lee C, Settels J, Sibert K, Rinehart J, Cannesson M: Machine-learning Algorithm to Predict Hypotension Based on High-fidelity Arterial Pressure Waveform Analysis. Anesthesiology 2018; 129: 663–674 [DOI] [PubMed] [Google Scholar]
- 20.Sessler DI, Turan A, Stapelfeldt WH, Mascha EJ, Yang D, Farag E, Cywinski J, Vlah C, Kopyeva T, Keebler AL, Perilla M, Ramachandran M, Drahuschak S, Kaple K, Kurz A: Triple-low Alerts Do Not Reduce Mortality: A Real-time Randomized Trial. Anesthesiology 2019; 130: 72–82 [DOI] [PubMed] [Google Scholar]
- 21.Huecker M: The Deliberate Practice of Medicine. J Grad Med Educ 2018; 10: 599–600 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Liu N, Luo K, Yuan Z, Chen Y: A Transfer Learning Method for Detecting Alzheimer’s Disease Based on Speech and Natural Language Processing. Front Public Health 2022; 10: 772592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, Lillicrap T, Silver D: Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 2020; 588: 604–609 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.