A Surgeon’s Guide to Machine Learning

Daniel T Lammers; Carly M Eckert; Muhammad A Ahmad; Jason R Bingham; Matthew J Eckert

doi:10.1097/AS9.0000000000000091

. 2021 Sep 7;2(3):e091. doi: 10.1097/AS9.0000000000000091

A Surgeon’s Guide to Machine Learning

Daniel T Lammers ^*, Carly M Eckert ^†, Muhammad A Ahmad ^‡, Jason R Bingham ^*, Matthew J Eckert ^*,^‡,^§,^✉

PMCID: PMC10455424 PMID: 37635814

Supplemental Digital Content is available in the text.

Keywords: artificial intelligence, machine learning, surgery

Abstract

Machine learning (ML) represents a collection of advanced data modeling techniques beyond the traditional statistical models and tests with which most clinicians are familiar. While a subset of artificial intelligence, ML is far from the science fiction impression frequently associated with AI. At its most basic, ML is about pattern finding, sometimes with complex algorithms. The advanced mathematical modeling of ML is seeing expanding use throughout healthcare and increasingly in the day-to-day practice of surgeons. As with any new technique or technology, a basic understanding of principles, applications, and limitations are essential for appropriate implementation. This primer is intended to provide the surgical reader an accelerated introduction to applied ML and considerations in potential research applications or the review of publications, including ML techniques.

INTRODUCTION

Surgeons are faced with the challenge of prognostication every day. Whether assessing a patient’s perioperative risk in clinic or detecting critical illness in a postoperative patient, surgeons must accurately and quickly assess risk while making diagnostic and therapeutic decisions in complex situations. Traditionally, physicians have been limited to their bedside gestalt and rule-based scoring systems to risk-stratify patients based on a limited number of input variables. While clinical gestalt represents a unique tool physicians draw upon, the nonuniform nature highlights the need for more data-driven approaches. Current prognostic and risk-stratification tools, most of which are rules-based, are not designed to compute complex, nonlinear relationships leading to the potential for inaccurate models that cannot capture physiologic interactions. These considerations, and recent technological advancements, such as the widespread use of electronic health records (EHRs) collecting unlimited health data, support the incorporation of advanced analytics in healthcare through ML. As clinicians strive to provide the best care possible for their patients, they may be overwhelmed by the sheer volume and complexity of medical information and available data, further demonstrating the unique role of ML in clinical practice.¹

WHAT IS MACHINE LEARNING?

ML represents a subset of artificial intelligence that uses algorithms to develop models based on a wide array of data inputs, known as covariates. ML models are flexible and can account for complex, nonlinear relationships, unlike traditional statistical models that are predefined by data that is fit to the model.² ML-derived prognostic models can navigate and analyze large quantities of data that would otherwise be too complex and multidimensional for standard statistical analysis.³ In doing so, ML approaches have demonstrated the ability to identify new relationships and associations from within large datasets thought to be undetectable by human observations while offering the ability to create complex, nonlinear models that can encompass hundreds to thousands of covariates.⁴ These models can iteratively learn and improve their performance by adding new data enabling a dynamic system to augment clinical care. Not only can these techniques be used to assess traditional data such as vital signs and laboratory values, but they also allow for the analysis of diverse data types to include text, image, video, and sound files.⁵ These data types historically presented difficulties for scientific evaluation; their use now allows new opportunities for data analysis in healthcare. ML models offer the additional advantage of further flexibility when defining relationships, as they can account for a multitude of varying patient cohorts within a larger dataset.

While ML provides a robust set of data analysis tools, the associated complexities present a dilemma for clinical use. Crossover between clinicians and data scientists remains a rarity as each profession requires a deep understanding of their respective fields. However, without collaboration between both parties, there remains the potential for incorrect usage and false interpretation.⁶ For these reasons, a basic understanding of the ML process and the clinical problems being evaluated must be clarified to ensure the data, techniques, and interpretations are aligned.⁷ These concerns hold true as many statistical packages are beginning to incorporate basic ML techniques via user-friendly toolkits resulting in accessibility to novice researchers.⁸ This ready availability poses potential threats to the validity of model results, interpretations, and applications.⁹ Although several articles have been published that review ML models in healthcare and aim to distill its concepts for the clinical reader, there remains a gap in educating physicians, particularly surgeons, on reading and interrogating ML research. This paper aims to address this gap by providing the reader with an introductory guide to understanding the foundational techniques, interpretations, and implementations of ML in healthcare.

DEFINE THE USE CASE

One of the most important questions to ask when considering the use of technology is, “Do we need it?” ML algorithms can be applied to most data in medicine. However, simply because a tool is available should not mandate its use. A critical assessment of the clinical scenario in question and the related prognostic models currently in use should be evaluated before ML-based model development. ML algorithms may result in an overly complicated model for the clinical situation creating a scenario that would be better suited by more traditional methods. This scrutiny is also needed in cases with well-established, high-functioning models already in use. One such example is a simple bedside scoring system with high sensitivity, specificity, and predictive accuracy for massive transfusion protocol activation in trauma.¹⁰ Would a complex ML model offering 90% predictive accuracy offer a relevant advantage over the simplified model’s 85% predictive accuracy? In some use cases, this may be relevant, but for many clinical scenarios, it would be unnecessary.

Beyond model performance, what other factors (such as model transparency, parsimony, and clinician understanding) may also be affected by substituting this model with a more complex one? An honest assessment of the potential benefits ML provides and any accompanying trade-offs should be considered and addressed to ensure implementation does not result in little-to-no clinically relevant gain. When assessing ML-based literature, it remains vital that the authors address the need or the reasons for applying ML to the problem. There should be some mention of a current baseline or process to demonstrate the potential added value of ML, not merely applying ML techniques for a new publication.

An example surgical use case for possible ML applicability follows:

How is early postoperative sepsis currently identified? While multiple models exist for identifying septic patients during the initial presentation, few account for the potential clinical variables found in early post-surgical patients. Indeed, no unified scoring system is universally used, and diagnosis is often reliant upon the interpretation by numerous healthcare providers with disparate perspectives. While sepsis scoring systems exist and have been implemented into multiple EHRs, their frequent misclassification can lead to alarm fatigue and missed diagnoses.

This example points to the potential opportunity to explore an ML-based model, mainly when the performance metrics of the previously accepted prognostic models are insufficient to discriminate between septic and nonseptic patients. The reader should ascertain the benefit of the newly defined ML model via direct comparison to prior metrics. This comparison should be scrutinized as differences in the patient populations used to create each model may lead to unfair or inaccurate comparisons.³ For example, a postoperative sepsis model trained and tested on emergency general surgery patients may be very different from one based on a cohort of elective surgical or multisystem trauma patients.

As previously mentioned, data scientists and physicians historically have not had frequent collaboration between their respective fields. As such, ML-based studies remain susceptible to gaps in the translation of clinical problems to the data or vice versa. An initial critical question that must be posed remains, “Does this approach make clinical sense?” For instance, variables such as hair length or clothing color are unlikely to contribute to the early identification of septic patients. However, there may be spurious correlations in the data that lead an ML model to conclude that such covariates are predictive of the outcome of interest. While large-scale datasets allow for inputs of these factors, their clinical applicability must be valid before implementation despite the possibility of meaningful results during evaluation. There are many examples of spurious covariate inclusion, such as racialized algorithms and the long-term and disparate effect.¹¹ It should also be noted that at their most basic, ML models find associations between covariates and outcomes. Clinicians need to understand that, unless specifically defined, ML models are not determining causal relationships.

DATA MANIPULATION AND PREPARATION

Once the initial questions regarding the appropriate use and implementation of ML have been addressed, evaluating the data remains an essential and time-consuming activity for data scientists and research teams. Publication authors should present an accurate and digestible description of the methods used. This description should include explicit references to the data sources and methods used to extract the data (eg, data extraction via chart review versus utilization of previously defined databases). As opposed to traditional statistical analyses in which models may contain a limited set of predictors, ML-based models may evaluate hundreds to thousands of potential covariates. As a supplement to the article, the authors should make available a list of these covariates. A discussion regarding inclusion and exclusion criteria should be available for the reader to assess the model’s clinical relevance. Outcomes should be clearly defined, and their classification specified (eg, postoperative sepsis defined by the following ICD-10 codes occurring during the same inpatient encounter as, and following, procedures occurring with this inclusive set of CPT codes). This level of detail is necessary due to the complexity of these studies and the need to ensure appropriate analysis and clinical relevance. As ML is still relatively new to healthcare research, experienced and novice consumers should be afforded full transparency into model development. ML models are often termed “black boxes” for their opacity to those not involved in their development. Methods to increase transparency, from development to implementation, are an emerging field in machine learning research.

As many of the targeted outcomes within these studies are represented by rare events, it remains common for datasets to be imbalanced. For instance, in a subset of 1,000 surgical patients, perhaps 20–30 patients may develop signs of potential postoperative sepsis, representing a total of 2–3% of the population. Specifying the frequency of the rare outcome, or positive class, is critical for the reader to understand the degree of class imbalance within the study cohort.¹² Although class imbalance can offer a unique challenge, modeling techniques including the artificial creation of a more balanced dataset (ie, positive class being roughly equal to negative class) using oversampling and under-sampling have been described.¹³ Frequently, the minority class, or the class with the lowest percentage, represents the outcome of interest. The algorithms must have adequate opportunities to assess and learn from data points associated with both the positive and negative outcomes. There are strategies to address cases of severe class imbalance, which occurs when the number of instances of the negative class may be too low for training. Oversampling, which aims to increase the minority class artificially, is frequently used. With one of these techniques, synthetic minority oversampling technique, the minority class is expanded without altering the size of the majority class. In this way, no information is lost.¹³ Under-sampling, on the other hand, results in the removal of a portion of the majority class to improve the balance between classes. With under-sampling, only a portion of the instances of the majority class is retained, thus the researchers risk omitting critical information from their modeling. Under-sampling techniques do have a role, such as when data points of the majority class are strongly clustered or in combination with oversampling of the minority class. Although the details of the various techniques used for both oversampling and under-sampling are outside the scope of this paper, clinicians should be aware of these broad approaches when reading ML literature and when working with data scientists to develop their models.

Cost-sensitive learning is another approach to mitigating the effect of an imbalanced data set. With cost-sensitive learning, the model is penalized more for certain errors than for others (eg, the model is penalized more for a false negative than for other classification errors). By learning to minimize the misclassification cost, the algorithm moderates the effect of class imbalance.

When obtained from large databases and chart reviews, real-life medical data is complex, noisy, and incomplete. The authors should document the degree of missing data from the original dataset, which may hold important clinical implications. The patterns of missing data should also be explored due to the implications for imputation. For example, data missingness should be evaluated to be Missing Completely at Random, Missing at Random, or Missing Not at Random. One of the most straightforward approaches to addressing missing data is complete case analysis, in which any instances with a missing data element are removed from the data set to be analyzed. Of course, this can often lead to the removal of a large portion of rows in a clinical dataset. Unless these missing data elements are missing completely at random, sampling bias within the completed data will be high and may alter the analysis. Instead of removing incomplete data, many techniques exist to enable the analysis of data sets with missing data points, including various forms of imputation. Single imputation methods include carry forward methods in which the last or most recent value for a particular variable is imputed for the missing instance. Another approach is mean or median imputation in which the statistical value for the instances is calculated and that value is then used for the missing value. While this approach leverages some information about the distribution of the value across known instances, it reduces the inherent variability in the data set by assigning a single value to all the missing instances. K-nearest neighbors (KNN) is another approach that can be used. While still a single imputation method, this approach allows for variability in the value imputed across instances. For KNN imputation, when an instance (in this case, a patient encounter) has a missing value for a specific variable, the algorithm identifies adjacent instances that do have a value for the variable of interest and identifies the k nearest instances when considering how close, via Euclidean distances, other instances are to the one of interest.¹⁴ Given these k nearest neighbors, their values for the parameter of interest are then averaged and assigned to the instance with the missing value. Multiple imputations by chained equations (MICE) is a more sophisticated approach to imputation that has gained traction over recent years.¹⁵ This approach generates multiple complete datasets (usually 5 to 10) from the incomplete dataset by developing linear models for each parameter with missingness. The multiple imputed datasets are then each used as the substrate for the predictive model. The results are pooled across all complete data sets to determine the overall results.

An additional and beneficial method for imputation draws on the concept of informed missingness using binary indicators. This method can be used when we assume that data are missing, not at random, as if often the case with clinical data. Consider when laboratory results may be missing for a particular patient’s encounter. If a 34-year-old woman being evaluated for abdominal pain does not have a lactate result in the EHR, we would not say that the result is “missing.” It was simply not ordered as it was not clinically indicated. Instead of using the previously mentioned strategies to impute a value here, it can be more useful to transform this feature into a binary variable indicating if the patient had a lactate drawn.

These include “carry forward” methods, mean- or median-value imputation, feature similarity via K-nearest neighbor algorithms, and more complex techniques such as MICE and various deep learning approaches.¹⁴ Thus, many advanced methods for handling missing data exist, and the authors should understand the pros and cons for each before implementation.

Given the large sizes of datasets upon which ML methods are usually applied, more features can be explored due to the robust nature of these algorithms compared to traditional statistical models. While the size of the datasets can manage a large number of predictors, the authors may also reference feature selection methods in the manuscript. Reducing the number of predictors in a model may be preferred from the clinical perspective, as providers may perceive parsimonious models as more interpretable.¹⁶ Feature selection methods are numerous and may include filtering or wrapping methods, information gain, or back selection.¹⁷ In contrast, neural networks use the complete set of raw data, omitting the need for manual feature selection. The deep learning approach can be useful when clinical expertise is unavailable or if the purpose of the modeling is to discover previously unknown relationships between data elements and the outcome of interest.

Model Selection and Development

While an in-depth understanding of the various ML techniques provides an advantage for clinicians interpreting ML-based manuscripts, this is neither necessary nor realistic. A basic understanding of ML concepts should be sufficient to guide clinicians through most “Methods” sections they encounter. Numerous resources are available should one be inclined to further their knowledge, and additional foundational ML education is readily available (Supplemental Data 1, http://links.lww.com/AOSO/A56). Most ML methods fall into one of four groups: supervised learning, unsupervised learning, semisupervised learning, and reinforcement learning.³ While the focus of this primer is on supervised learning, we will briefly introduce each of these groups.

Supervised learning algorithms represent the most common ML techniques for predictive models in healthcare. When using these methods, the outcome is identified a priori by the research team and is available as a labeled data field within the dataset.³ Using many covariates, or predictor variables, the supervised ML algorithm navigates the supplied examples of data to identify an optimal map from these covariates to the defined outcome. In the postoperative sepsis example, we identified inpatient encounters during which surgical procedures occurred. Encounters in which the patient develops postoperative sepsis define our positive class. Predictor variables would be drawn from patient demographics, comorbidities, operation type, vital signs measurements, and laboratory results to create a model to predict if a patient will become septic or not. Additional patient encounters provide the model more examples to learn from, generally resulting in improved model performance. This improvement continues until a plateau is reached, which is an area of active research in ML. The exception to this performance plateau is deep learning models whose performance can continue to improve with additional data.³ Supervised ML approaches can predict categorical outcomes (eg, binary “septic”/“not septic”) and continuous outcomes via regression algorithms (eg, length of hospital stay among septic patients). The most common supervised ML approaches are outlined in Table 1.

TABLE 1.

Common Supervised Machine Learning Algorithms

Algorithm Types	Brief Description
Decision tree	An ML model in the form of a tree with a sequence of nodes which are either a decision associated with particular values of a variable or dependent variable
Logistic regression	A statistical model that uses a logistic function to model a binary dependent variable
Naive Bayes	An ML model based on applying Bayes’ theorem with the naive assumption of conditional independence between all pairs of independent variables given the dependent variable
Support vector machines	An ML model that classifies instances by creating an optimal boundary between variables after mapping them to a higher dimensional space
Ensemble methods	A set of methods that use multiple learning models to obtain better predictive performance than could be obtained from any of the constituent models alone
Random forest	An ensemble learning method that employs multiple, even hundreds or thousands, of decision trees as its constituent models
Neural networks	Also known as deep learning, inspired by biological neural networks in humans and animal brains. These models can consider the sequence of clinical events in predictive modeling

Open in a new tab

With unsupervised learning, the data set is unlabeled. That is, there is no outcome variable identified in the data. With this approach, the algorithm identifies commonalities between different instances and attributes in the data, for example, it determines associations in the data or groups instances in the data into clusters based on like characteristics. In clinical use cases, unsupervised learning can be useful to identify otherwise unknown or unsuspected patterns among patient encounters that may highlight variation in care delivery or patient outcomes.

Semisupervised learning is a type of machine learning that can be considered halfway between supervised and unsupervised learning. It is employed in data settings that include both labeled and unlabeled data. In most cases of semisupervised learning, the labeled data constitutes a minuscule percentage of all data instances. With this approach, a combination of supervised and unsupervised methods are used to uncover patterns and develop prediction models according to the needs of the use case.

Reinforcement learning is a type of machine learning in which the algorithm learns from the surrounding environment. Instead of making a single decision or prediction, it involves a sequence of decisions. The algorithm gets feedback based on its actions—in the form of rewards or penalties. This feedback enables the algorithm to update its response or strategy in the environment. Reinforcement learning can be useful in contexts where significant heterogeneity in patient characteristics is observed.

Deep learning is an approach to machine learning, which uses algorithms called artificial neural networks. Deep-learning models can use supervised, unsupervised, semisupervised, or reinforcement learning. A significant difference between traditional ML methods and deep learning is that the former often requires labor-intensive preprocessing of data to prepare domain-relevant variables in a format used by the algorithm. Deep learning models can be more complex than traditional ML models, often requiring thousands of parameters. Given the complexity of the models, humans cannot understand how the models make predictions. The use of deep learning models in healthcare must balance the very high accuracy of these models versus the relative opacity of their mechanisms.

Once an algorithm has been selected, further data evaluation is required to ensure the proposed ML algorithm is appropriate for use. Two of the most frequently described methods to accomplish this include hold-out testing and cross-validation. Simply put, hold-out testing requires a user-defined split of the dataset into training and testing cohorts. Frequently, the training cohort is composed of 70% to 80% of the original dataset, while the test cohort is composed of the remaining cases. The authors should specify how this division occurred as the techniques should be deliberately chosen based on the study at hand. While some approaches aim to maintain equal ratios of “positive” and “negative” outcomes between the training and testing cohorts, others create these subsets based on artificial divisions such as temporal indicators or random selection. When temporal divisions are used, readers should understand that changes in practice patterns, clinical definitions, or administrative coding over the dataset timeline may influence data classification (eg, ICD 9 to ICD 10 or sepsis 3 criteria). An equal ratio of positive class prevalence between training and testing cohorts is generally encouraged for adequate representation.³ Depending on the use case, the research team may also consider the distribution of certain demographic features (eg, race, ethnicity, or insurance status) between the training and testing sets to ensure broad representation. If patients are represented multiple times in the data set, all data related to a specific patient should be confined to either the training set or the testing set; otherwise, target leakage may occur. Following this process, the model is built on the training set and subsequently assessed based on the testing set.

Cross-validation, on the other hand, represents a technique where the entirety of the dataset is split into k groups with k representing a user-defined number of divisions.³ For example, if k = 10, then the dataset would be divided into ten equal groups. The chosen algorithm is then trained using k-1 sets and tested on the remaining groups. This process is then repeated k times until each group is utilized as the test cohort. Metrics across each of the tested rounds are then averaged to obtain overall model performance metrics. Both of these techniques offer their own advantages, and each has been well described within the literature. While cross-validation is frequently regarded as providing a better indication of model performance due to its ability to analyze numerous train-test groupings, hold-out testing is traditionally utilized on larger datasets due to the decreased processing power required and is more computationally efficient. Furthermore, hold out testing offers the theoretical advantage of providing results from unseen data as the testing cohort has never crossed over into the training cohort, whereas within cross-validation, the final output is a conglomerate data over multiple train-test groupings.

Another common approach that may bypass some of the previously stated complexities is cross-validation. With cross-validation, the dataset is divided into k groups.³ The k term is determined by the research group but is usually between 5 and 10. For example, if the k term is set to 10, the dataset is divided into 10 equal groups, and the model is trained on nine groups and then tested on the tenth. This process repeats itself ten times until each group has served as the “test set.” Metrics across each of these ten rounds are averaged to obtain overall model metrics.

Evaluating Performance

When evaluating ML-based research, the reader should consider multiple performance metrics:

How well does the model discern the true positives from the false positives (precision)?
How well does it discern the true positives from the false negatives (recall)?
How well does it risk stratify the scored cohort (calibration)?
Is the model explainable?
Is it fair and free of bias?

We previously mentioned the potential pitfalls of reporting model accuracy, and no metric should be reported in isolation. Unless the paper is validating an existing model or reporting results from a prospectively scored system, the initial results assess the model’s performance on the testing cohort. The performance among the testing cohort describes how well the model predicts the outcome among novel examples of data. Despite the complexities of ML models described here, there are strategies to explain ML model outputs to clinical users. At their simplest, these strategies provide clinicians with a list of covariates determined as “most significant” to the model in generating its predictions. These explanations can be provided for the model across a population, a patient cohort, or an individual patient.¹⁶

Class imbalance, explained earlier in the manuscript, also affects performance metrics and the choice of metrics.¹² This imperative can be highlighted by the above example assessing postoperative sepsis. If the prevalence of postoperative sepsis is found to be 3%, an ML model that classifies every patient as non-septic would display a 97% accuracy. However, this same model has 0% precision (ie, positive predictive value) and recall (ie, sensitivity). The reader should perform a critical appraisal of the data and chosen performance metrics for their statistical salience and clinical implication. Model performance can be optimized for specific metrics if clinically indicated (eg, recall can be prioritized if false negatives should be minimized). In addition to the discriminative capability of the model, the calibration of the model should also be described. Calibration refers to how well a model assesses the magnitude of risk of an outcome among those being scored to the actual outcome. For example, if a model predicts that a group of patients has a 10% risk of developing sepsis, calibration refers to how closely that predicted risk is to the actual observed outcome of the group (ie, did 10% of that group develop sepsis?). This metric can be particularly important in clinical prediction models where patient-level risk estimates may influence clinical decisions.

Model performance can be further evaluated by testing on a validation data set. While validation data may take many forms, it is usually collected prospectively from the same setting as the training and testing sets, or it may represent data from a different setting entirely. While performance metric values akin to those seen in model testing strengthen the model’s case, low-performance metrics suggest that the devised model may not be generalizable to external populations.³ Model performance generally degrades once placed into a live clinical environment due to various factors (eg, changes in the characteristics of the underlying population, clinical definitions, or provider practice patterns). In addition to general model metrics, there may be clinical relevance to model performance across cohorts. Models may have differential performance across patient groups defined by age, sex, race, and so on. This differential performance may require additional processing to ensure fairness. The result of these factors is that ML models in a live clinical setting require ongoing maintenance to ensure consistent results.

Due to the ability to continuously input new data, ML algorithms offer the unique opportunity to evolve with new cases. This ability to learn from newly input training data frequently improves overall model performance; however, this is not without its potential risks. Evaluation of newly input data should be scrutinized to ensure the model is still being trained to assess the original target in mind. For instance, suppose an ML model was implemented within a large hospital to predict patients at risk for postoperative sepsis. Following initial model training, high precision and recall were noted. This subsequently allowed providers to intervene on modifiable risk factors earlier in the patent hospital course resulting in decreased rates of postoperative sepsis. While this model appears to have succeeded in its goal, assigning new data to the training set following its implementation would create a problematic scenario.¹⁸ For these reasons, the assessment of ML-based algorithms remains complicated but achievable with appropriate forethought and insight. Commitment to future model maintenance is essential when ML is applied in dynamic systems, such as those potentially affecting patient care.

CONCLUSIONS

ML offers the ability to expand upon our current knowledge and improve healthcare-related research. Although these techniques may appear cumbersome at first, many open-sourced resources and introductory materials are readily available. Furthermore, building partnerships and collaborations with members of the data science community will help improve healthcare researchers’ understanding of these topics and drive innovation for improving healthcare outcomes. As such, when feasible and appropriate, incorporating these techniques into the surgical research armamentarium should be encouraged to explore and advance this promising, yet relatively unexplored, area of study.

Supplementary Material

as9-2-e091-s001.pdf^{(9.9KB, pdf)}

Open in a new tab

Footnotes

C.M.E. and M.A.A. are equity owners of KenSci Inc., Seattle, WA. No proprietary or product applications are discussed in this article, designed to be a purely educational publication. M.J.E. is spouse of C.M.E. The views expressed are those of the authors and do not reflect the official policy or position of the Department of the Army, the Department of Defense, or the US government.

D.T.L. and C.M.E. did research design and writing of paper. J.R.B. and M.A.A. did writing of paper. M.J.E. did research design and writing of paper.

Supplemental digital content is available for this article. Direct URL citations appear in the printed text and are provided in the HTML and PDF versions of this article on the journal’s Web site (www.annalsofsurgery.com).

REFERENCES

1.Obermeyer Z, Lee TH. Lost in thought—The limits of the human mind and the future of medicine. NEJM. 2017;377:1209–1211. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Breiman L. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci. 2001;16:199–231. [Google Scholar]
3.Pang-Ning T, Steinbach M, Kumar V. Introduction to Data Mining. Pearson Education India; 2016. [Google Scholar]
4.Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2:1–21. [Google Scholar]
5.Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380:1347–1358. [DOI] [PubMed] [Google Scholar]
6.Cao L. Data science: nature and pitfalls. IEEE Intelligent Systems. 2016;31:66–75. [Google Scholar]
7.American Medical Association. AMA passes first policy recommendations on augmented intelligence. 2018. Available at: https://www.ama-assn.org/ama-passes-first-policy-recommendationsaugmented-intelligence. Accessed December 19, 2020.
8.Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. JMLR. 2011;12:2825–30. [Google Scholar]
9.Cabitza F, Ciucci D, Rasoini R. A giant with feet of clay: On the validity of the data that feed machine learning in medicine. In: Cabitza F, Batini C, Magni M, eds. Organizing for the Digital World Lecture Notes in Information Systems and Organisation. 2018:121–136. [Google Scholar]
10.Nunez TC, Voskresensky IV, Dossett LA, et al. Early prediction of massive transfusion in trauma: simple as ABC (assessment of blood consumption)? J Trauma. 2009;66:346–352. [DOI] [PubMed] [Google Scholar]
11.Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. NEJM. 2020;383:874–882. [DOI] [PubMed] [Google Scholar]
12.Guo X, Yin Y, Dong C, Yang G, Zhou G. On the class imbalance problem. 2008 Fourth International Conference on Natural Computation. 2008.doi: 10.1109/icnc.2008.871. [Google Scholar]
13.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. JAIR. 2002;16:321–357. [Google Scholar]
14.Jerez JM, Molina I, García-Laencina PJ, et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med. 2010;50:105–115. [DOI] [PubMed] [Google Scholar]
15.Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92. [Google Scholar]
16.Ahmad MA, Eckert C, Teredesai A. Interpretable machine learning in healthcare. Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2018. doi: 10.1145/3233547.3233667. [Google Scholar]
17.Kuhn M, Johnson K. Applied Predictive Modeling. Springer; 2013. [Google Scholar]
18.Lenert MC, Matheny ME, Walsh CG. Prognostic models will be victims of their own success, unless…. J Am Med Inform Assoc. 2019;26:1645–1650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Obermeyer Z, Lee TH. Lost in thought—The limits of the human mind and the future of medicine. NEJM. 2017;377:1209–1211. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Breiman L. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci. 2001;16:199–231. [Google Scholar]

[R3] 3.Pang-Ning T, Steinbach M, Kumar V. Introduction to Data Mining. Pearson Education India; 2016. [Google Scholar]

[R4] 4.Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2:1–21. [Google Scholar]

[R5] 5.Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380:1347–1358. [DOI] [PubMed] [Google Scholar]

[R6] 6.Cao L. Data science: nature and pitfalls. IEEE Intelligent Systems. 2016;31:66–75. [Google Scholar]

[R7] 7.American Medical Association. AMA passes first policy recommendations on augmented intelligence. 2018. Available at: https://www.ama-assn.org/ama-passes-first-policy-recommendationsaugmented-intelligence. Accessed December 19, 2020.

[R8] 8.Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. JMLR. 2011;12:2825–30. [Google Scholar]

[R9] 9.Cabitza F, Ciucci D, Rasoini R. A giant with feet of clay: On the validity of the data that feed machine learning in medicine. In: Cabitza F, Batini C, Magni M, eds. Organizing for the Digital World Lecture Notes in Information Systems and Organisation. 2018:121–136. [Google Scholar]

[R10] 10.Nunez TC, Voskresensky IV, Dossett LA, et al. Early prediction of massive transfusion in trauma: simple as ABC (assessment of blood consumption)? J Trauma. 2009;66:346–352. [DOI] [PubMed] [Google Scholar]

[R11] 11.Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. NEJM. 2020;383:874–882. [DOI] [PubMed] [Google Scholar]

[R12] 12.Guo X, Yin Y, Dong C, Yang G, Zhou G. On the class imbalance problem. 2008 Fourth International Conference on Natural Computation. 2008.doi: 10.1109/icnc.2008.871. [Google Scholar]

[R13] 13.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. JAIR. 2002;16:321–357. [Google Scholar]

[R14] 14.Jerez JM, Molina I, García-Laencina PJ, et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med. 2010;50:105–115. [DOI] [PubMed] [Google Scholar]

[R15] 15.Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92. [Google Scholar]

[R16] 16.Ahmad MA, Eckert C, Teredesai A. Interpretable machine learning in healthcare. Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2018. doi: 10.1145/3233547.3233667. [Google Scholar]

[R17] 17.Kuhn M, Johnson K. Applied Predictive Modeling. Springer; 2013. [Google Scholar]

[R18] 18.Lenert MC, Matheny ME, Walsh CG. Prognostic models will be victims of their own success, unless…. J Am Med Inform Assoc. 2019;26:1645–1650. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Surgeon’s Guide to Machine Learning

Daniel T Lammers, MD

Carly M Eckert, MD, MPH

Muhammad A Ahmad, PhD

Jason R Bingham, MD

Matthew J Eckert, MD

Abstract

INTRODUCTION

WHAT IS MACHINE LEARNING?

DEFINE THE USE CASE

DATA MANIPULATION AND PREPARATION

Model Selection and Development

TABLE 1.

Evaluating Performance

CONCLUSIONS

Supplementary Material

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Surgeon’s Guide to Machine Learning

Daniel T Lammers, MD

Carly M Eckert, MD, MPH

Muhammad A Ahmad, PhD

Jason R Bingham, MD

Matthew J Eckert, MD

Abstract

INTRODUCTION

WHAT IS MACHINE LEARNING?

DEFINE THE USE CASE

DATA MANIPULATION AND PREPARATION

Model Selection and Development

TABLE 1.

Evaluating Performance

CONCLUSIONS

Supplementary Material

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases