Abstract
Electronic health records have facilitated the extraction and analysis of a vast amount of data with many variables for clinical care and research. Conventional regression-based statistical methods may not capture all the complexities in high-dimensional data analysis. Therefore, researchers are increasingly using machine learning (ML)-based methods to better handle these more challenging datasets for the discovery of hidden patterns in patients’ data and for classification and predictive purposes. This article describes commonly used ML methods in structured data analysis with examples in orthopaedic surgery. We present practical considerations in starting an ML project and appraising published studies in this field.
Keywords: Machine Learning, Artificial Intelligence, Tabular Data, Electronic Health Records
Graphical Abstract

Introduction
Electronic health records (EHR) are now commonplace in health systems around the world, recording a tremendous amount of information for every patient encounter. This data is mainly stored in tabular form, something similar to an Excel spreadsheet. This structured data can help physicians make more informed decisions based on their patient’s specific needs and conditions(1). These “big data” repositories can also be used for research to better evaluate complex relationships between clinical variables and patient outcomes. By having more variables (ie, high-dimensional data) to analyze, one can move to machine learning (ML) methods that can capture the complex, previously unknown relationships between many variables(2).
Traditionally, researchers have used regression-based models to find relationships between clinical variables and patient outcomes. For example, the Mayo Clinic score for periprosthetic joint infection (PJI) was created using logistic regression models from six demographic and surgical variables to predict the risk of postoperative infection(3). There are other models for predicting post-operative complication risk using ML(4–6). This, along with many risk assessment scores used in orthopaedic surgery, leverages traditional data analysis methods. They yield familiar and interpretable measures like odds ratio, hazard ratios, or relative risks, which is one of the reasons that makes them very popular.
Limitations of Conventional Statistics
Although these conventional statistical methods are very powerful, they may become less flexible when dealing with more than a dozen variables. As a rule of thumb, one should have at least ten occurrences of the outcome event for every variable fed into a regression model(7). Moreover, in some cases, traditional statistical models such as linear regression may not be powerful enough to capture complex, nonlinear relationships between variables and their “interactions”. Interaction is a statistical term that defines a situation when the effect of one variable on the outcome depends on the level of another variable. Regression models require additional statistical methods to detect interactions between variables. Furthermore, conventional regression-based models typically assume a linear relationship between the variables and outcome. However, this is not a valid presumption in many clinical studies(8). For example, body mass index (BMI) has a “J”-like effect on the incidence of infection after hip fracture surgery. This means that patients who are underweight or overweight have a higher risk of infection compared to patients who have normal BMI(9). This highlights the need to consider approaches that use nonlinear associations of continuous variables before using conventional regression-based statistical models. The ML methods provide more flexibility than conventional statistical models to account for known and unknown interactions and nonlinear relationships between variables.
Statistics and Machine Learning
Artificial Intelligence (AI) is a general term that encompasses several fields focused on making computer systems perform tasks that traditionally have required human intelligence, Figure 1. The ML is a branch of AI with many various subtypes, such as deep learning (DL), which frequently uses multi-layer neural networks to fit a general nonlinear continuous function in high-dimensional space. When compared to traditional statistical methods, ML methods are computationally more intensive, apply data-driven approaches to develop model architecture, and require fewer prior assumptions by the researchers.
Figure 1.
The relationship between artificial intelligence, machine learning, and deep learning
The ML methods are within the continuum of statistical methods. A statistical model typically starts from a more simple set of variables and a defined functional form to estimate model parameters and test specific hypotheses on variables of interest. The chosen model may be one of many models that could fit to the data and, therefore, may not be the optimal model for the data(10). This latter pursuit for an optimal model is where machine learning can provide dividends. The ML frameworks provide mechanisms to optimize the functional forms and variables included in the model at a scale not generally considered feasible using models defined explicitly by the analyst in traditional statistics. Machine learning-based models try to learn a generalizable pattern in the data that can be applied to unobserved samples. The model determines an optimal combination of variables to maximize model performance with little human input. When training ML models, it is not specified in advance that the model should find a linear or logistic relationship between the variables and outcome; hence the learning process happens by finding this problem-specific solution.
It is important to emphasize that the generalizability of any statistical model depends on how much the training sample represents the overall population. Specifically for an ML model, the generalizability depends not only on the training sample, but also on the features the model learns. ML models with near-perfect performance on their training set have adapted some features that help them memorize the label for each sample in the training set. These features are not necessarily present in unseen samples; hence the model may not be generalizable. By analogy, a person who has memorized answers on a practice exam may not perform well on the real exam.
Machine Learning Applications: Supervised vs. Unsupervised
The tasks that can be performed by ML algorithms fall into two general categories: Supervised and Unsupervised Learning. Supervised learning has a ground truth, and the model approximates its prediction to the ground truth. For example, with a dataset of every instance of implant failure after total hip arthroplasty in a certain population, a supervised learning model could be used to find a relationship in the dataset to predict implant failure. Supervised learning is used with classification and regression tasks. A common example of a classification task is outcome prediction, as seen in the previous reimplant example above. Classification has many variations. Multi-class classification is choosing one category from more than two available options. Multi-label classification is where each subject might belong to multiple classes (e.g., a patient might have both joint infection and dislocation after a hip arthroplasty). Regression tasks are those with a goal to output a continuous number, like predicting patient bone age based on qualitative characteristics.
Unsupervised learning does not have a ground truth for the problem; instead, it tries to find relationships between the data points to organize the information. A clustering task falls under unsupervised learning. For example, we want our model to categorize patients who have prosthetic joint infection into meaningful groups based on severity. In this case, we use the model to find some features that can help categorize these patients in real practice.
ML Models for Tabular Data Analysis
The ML models can be a useful tool for analyzing tabular data. Tabular data is typically encoded into a spreadsheet-like structure. In order for the spreadsheet to be useful for ML modeling, it must have appropriate data in columns, be clean and be tidy. The columns should provide data elements (independent variables or features) and at least one outcome of interest (dependent variable for supervised tasks, or “label”). Tabular data would typically have some defined variables that have been encoded to have a discrete meaning (e.g., age in years). Data cleaning is the process of encoding discrete meaning for all the variables. For example, all yes/no questions should be converted to ones and zeros for the model to understand them. Also, ML spreadsheets must be “tidy,” meaning all variables must be contained in 1 row per patient or per clinical visit.
There are many families of supervised and unsupervised ML models which are applicable for the analysis of tabular data. In this section, we briefly summarize the basic concepts behind different ML methods to help clinical experts better evaluate the potential use of ML for their projects.
Tree-based models
Decision trees are predictive ML models which predict outcome(s) (termed “leaves”) based on observations (termed “branches”). Examples of decision trees are diagnostic algorithms, in which certain workups should be done based on specific signs and symptoms. In an ML context, decision trees are “data-driven,” meaning that they are learned empirically from a dataset. Trees simplify a dataset into a set of yes/no decisions based on how the branches of the trees interact. Because the tree-based models have dichotomous branches, they can easily conform to non-linearities in variable relationships. Moreover, these models can show interactions between variables. For example, in Figure 2a, the branch pertaining to women have an extra node that categorizes patients based on their age, but there is no such node for men. This demonstrates an interaction between age and sex, which regression-based models may not have been evident.
Figure 2.
(a) A hypothetical decision tree for classifying patients for low and high risk of post-operative fracture, showing the interaction between sex and age. (b) A support vector machine (SVM) drawing a decision boundary between two groups of patients.
Several recent advances to decision trees make predictions much more accurate. One method, named “random forest,” fits dozens or hundreds of decision trees on different subsets of training data and uses voting to combine them into a single model(11). Another method, “boosted decision trees,” iteratively improves the model by increasing the penalty associated with previously misclassified examples(12). One example of these boosted algorithms is extreme gradient boosting (XGBoost) which has become a popular tool for applying ML on tabular data.
Tree-based methods are useful for medical data because they handle missing data well. With electronic health record-based research, the pattern of missing data is often related to a patient’s underlying condition (e.g., sicker patients get more laboratory assessments). Therefore, oftentimes the pattern of the missing data is predictive. Filling in missing values may not be ideal as this association could be lost. Tree-based methods such as gradient boosting machines (GBM) allow missing data to be a decision point in determining if a node can be created. For example, a node may have a rule like “age < 65 years, yes or no”. With a GBM model, the node decisions can also include “is bone density assessment available, yes or no.” This latter assessment may capture some higher-level clinical gestalt that encompasses constructs of age, prior fall risk, and prior history of fracture, etc.
Tree-based models are popular because they are explainable. For example, in a study by Kotti et al., various kinetic parameters were fed into a random forest classifier to diagnose osteoarthritis of the knee and grade its severity(13). Their efforts resulted in a visualized algorithm that can easily be utilized for patient care.
Support Vector Machines
Support vector machines (SVMs) are a supervised classification ML algorithm that partition the dataset into two groups by drawing a decision line that maximizes the separation between groups(14). In the simplest case, a linear SVM in a dataset with two variables will draw a decision boundary between the two variables (Figure 2b). The same principle holds true as SVMs extrapolate to a multidimensional space. For example, in a tabular dataset with N variables, SVMs generalize up to an N-1 dimensional space to construct a “hyperplane” between the groups formed by all variables(15).
Due to their ability to separate different groups, SVMs are very popular for choosing useful features from a pool of many available features, a process called “feature selection”. As an example, Yu et al. extracted quantitative features from spine ultrasounds and used an SVM to select features that could help identify different lumbar vertebrae(16). They used their model to detect the correct point of insertion of a lumbar puncture needle in 45 out of 46 unseen cases.
Neural Networks
Neural networks, sometimes termed “deep” neural networks, are a family of ML models, specifically DL models, which perform well on ML tasks. Although made popular in recent years due to advances in computer vision tasks, neural networks were the best model available for tabular datasets for decades. However, their performance on tabular data analysis may be surpassed by newer models such as boosted trees, depending on the specific problem and solution-space(17).
Moreover, due to the complexity of neural networks, deep learning approaches are generally less “explainable” than other ML models such as Random Forests.
Unsupervised learning techniques
Unsupervised learning is used to discover naturally occurring patterns within a dataset(18). Unsupervised algorithms are useful for a number of research problems, especially in the early, exploratory stage of a project when the dataset is not entirely understood, where they can reveal useful associations in the data or identify outliers. Unsupervised algorithms should be considered a set of tools to aid the understanding of complex datasets and not a standalone approach. For one thing, unsupervised algorithms need much larger datasets compared to supervised algorithms. This hinders their application in small studies where supervised algorithms might yield better results. Two popular examples of unsupervised machine learning algorithms are cluster analysis, such as K-means clustering, which seeks to divide a set of data into meaningful groups, and principal component analysis (PCA), which aims to combine similar variables to reduce the total number of variables in the dataset. For example, Kruse et al. used an unsupervised clustering algorithm to group patients based on their fracture risk and then studied group characteristics to find specific risk factors from a pool of variables(16).
Points for Researchers and Readers
Sample Size
A common question when starting a new data science endeavor is, “How much data will be required?” The simple (and often unsatisfying) answer is that ML algorithms almost always do better with more data. Unfortunately, it can be difficult to predict how many samples are good enough for training; there is no equivalent to “Power calculations” in ML. A good rule of thumb is that ML algorithms need ten times more events per variable when compared to conventional statistical methods to achieve stable performance(19). Techniques such as data augmentation have been developed for “artificially” boosting the sample size of a dataset, but they are not a replacement for adequate dataset size. A remedy to this problem is aggregating datasets from multiple institutions.
The ML algorithms and traditional statistical models both become more generalizable with balanced datasets. A balanced dataset has roughly equal numbers of samples with and without the outcomes of interest (called “classes”). Many research databases have unbalanced datasets, where one class has tens or even thousands of times more samples than another. Unbalanced classes can be mitigated with approaches like oversampling and undersampling, meaning copying the minority class several times(20). These approaches run the risk of introducing bias in training and might accidentally overemphasize features that would not help the training process. In addition, there are more advanced techniques that create synthetic data with the same properties of the desired category and mitigate the effect of a small sample size in a specific group(21).
Data Splitting for Model Optimization
The ML algorithms can easily detect noise in data and often learn noise patterns rather than the actual association of the researched variables. This phenomenon is called overfitting. Overfitting causes the model to have a too-perfect performance on the training data, which is misleading and is not generalizable to unseen data. To detect and avoid overfitting, we must test whether a model performs well on a separate set of data, a sample that it did not see during training. This sample is called the “test set.”
One of the steps that a researcher must take before training an ML model is to split their data into training and testing sets(22). A straightforward approach is to randomly put aside a percentage of the data for testing purposes. Though this might seem convenient, the researcher must pay careful attention that these sets have a similar distribution of variables; for example, around 5% of the samples have the desired outcome in both sets. Another concern that should be addressed during data splitting is to make sure that there is no data leakage between the two sets. One common source of the leakage is when one patient has multiple encounters, which are represented in both sets. Leakage can also happen when there are multiple instances of a single patient, and they fall into separate sets. For example, a patient might have several joint infections during a study on determining a risk score for joint infection; the researcher should ensure that all the infection instances from this specific patient are either in the training set or in the test set, not both.
Model Tuning
Each ML model has settings called hyperparameters that can be tuned and optimized. After many experiments, a researcher might gain intuition on how changing a hyperparameter can affect their results. Strangely, one of the best approaches to becoming comfortable with model tuning is to experiment with random combinations of hyperparameters and see which one yields the best results(23). These hyperparameters provide extra flexibility to a model that might, unfortunately, lead to learning ungeneralizable features. This creates the need for another set of data to see the results of tuning and choose the best combination. This set is called a validation set. The same rules mentioned in splitting the test set also apply to the validation set. Researchers should pay extra attention to tuning model hyperparameters because this step can markedly affect the final model performance. Only after deciding on the best hyperparameters should the researcher evaluate their model’s performance on the test set. It is crucial to refrain from optimizing the model’s hyperparameters based on its performance on the test set.
Generalizability
One of the major challenges in applying machine learning to orthopaedic surgery research is the issue of data quality, availability, and generalizability. Machine learning algorithms rely on large amounts of high-quality data to train and validate their models. However, obtaining such data can be difficult due to issues such as missing or incomplete data, inconsistent data entry, and privacy concerns. In addition, data collected from one institution or population may not be representative of other institutions or populations, limiting the generalizability of the results.
To address these challenges, researchers must take steps to ensure that their data is of high quality and representative of the population they are studying. This may involve cleaning and preprocessing the data to remove errors and inconsistencies, as well as collecting data from multiple sources to increase its representativeness. In addition, researchers must carefully consider the limitations of their data when interpreting their results. More importantly, the results of a study should be validated on data from other institutions to ensure model generalizability.
Using Statistical Methods
All of the machine learning techniques discussed in this paper are inseparable from “classical” statistical methods. Researchers should always use classical statistical methods to augment ML methods. Regression-based statistical models make a great baseline to be compared to ML algorithms’ performance. One of the advantages of starting with the machine learning approaches is it may give insight into important variables and some of the highest statistical performance obtainable from the dataset. These metrics provide excellent context for the overall performance as well as stimulate ideas on how simpler models may be considered. Thus, a more parsimonious and perhaps more explainable model can be motivated by the machine learning approaches to ensure easier integration into practice.
Importantly, as conventional statistical models are more explainable, if they achieve similar performance, they are, in fact, superior to ML algorithms(24). Moreover, one should utilize statistical analysis to compare performance between different models (ML- or regression-based) and report the significance of their findings. These cases highlight the need for the use of statistics in ML research. Indeed, we call upon all researchers in orthopaedic surgery reporting ML analysis of tabular data to include a comparison to traditional statistics so that reviewers and readers have an informed basis to understand whether the ML approach is providing an advantage. This is particularly important during the rapid adoption of ML techniques in orthopaedic surgery research and while reviewers and readers take on the learning curve of this new field of data science.
Conclusion
The emergence of big data stored in EHR unlocks the ability to use more sophisticated ML algorithms to analyze complex relationships that underlie our health. In this review, we have highlighted several ML algorithms that perform well on tabular data in medicine and orthopaedics, including support vector machines, random forests, gradient boosting algorithms, and deep neural networks. We have also discussed some of the basic principles that are important for ensuring success in an ML project, including ensuring an adequate sample size, techniques for evaluating the dataset, and the importance of including classical statistics in the research approach.
Supplementary Material
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.Murdoch TB, Detsky AS. The inevitable application of big data to health care. JAMA [Internet]. 2013. Apr 3;309(13):1351–2. Available from: 10.1001/jama.2013.393 [DOI] [PubMed] [Google Scholar]
- 2.Obermeyer Z, Emanuel EJ. Predicting the Future - Big Data, Machine Learning, and Clinical Medicine. N Engl J Med [Internet]. 2016. Sep 29;375(13):1216–9. Available from: 10.1056/NEJMp1606181 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Berbari EF, Osmon DR, Lahr B, Eckel-Passow JE, Tsaras G, Hanssen AD, et al. The Mayo prosthetic joint infection risk score: implication for surgical site infection reporting and risk stratification. Infect Control Hosp Epidemiol [Internet]. 2012. Aug;33(8):774–81. Available from: 10.1086/666641 [DOI] [PubMed] [Google Scholar]
- 4.Khosravi B, Rouzrokh P, Maradit Kremers H, Larson DR, Johnson QJ, Faghani S, et al. Patient-specific Hip Arthroplasty Dislocation Risk Calculator: An Explainable Multimodal Machine Learning-based Approach. Radiol Artif Intell [Internet]. 2022. Nov;4(6):e220067. Available from: 10.1148/ryai.220067 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wyles CC, Maradit-Kremers H, Fruth KM, Larson DR, Khosravi B, Rouzrokh P, et al. Frank stinchfield award: Creation of a patient-specific total hip arthroplasty periprosthetic fracture risk calculator. J Arthroplasty [Internet]. 2023. Mar 17; Available from: 10.1016/j.arth.2023.03.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wyles CC, Maradit-Kremers H, Larson DR, Lewallen DG, Taunton MJ, Trousdale RT, et al. Creation of a Total Hip Arthroplasty Patient-Specific Dislocation Risk Calculator. JBJS [Internet]. 2022. Jun 15 [cited 2022 Jul 29];104(12):1068. Available from: https://journals.lww.com/10.2106/JBJS.21.01171 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Riley RD, Snell KI, Ensor J, Burke DL, Harrell FE Jr, Moons KG, et al. Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes. Stat Med [Internet]. 2019. Mar 30;38(7):1276–96. Available from: 10.1002/sim.7992 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Goldstein BA, Navar AM, Carter RE. Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges. Eur Heart J [Internet]. 2017. Jun 14;38(23):1805–14. Available from: 10.1093/eurheartj/ehw302 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Akinleye SD, Garofolo G, Culbertson MD, Homel P, Erez O. The Role of BMI in Hip Fracture Surgery. Geriatr Orthop Surg Rehabil [Internet]. 2018. Feb 12;9:2151458517747414. Available from: 10.1177/2151458517747414 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Box GEP. Science and Statistics. J Am Stat Assoc [Internet]. 1976. Dec 1;71(356):791–9. Available from: https://www.tandfonline.com/doi/abs/10.1080/01621459.1976.10480949 [Google Scholar]
- 11.Ho TK. Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition [Internet]. 1995. p. 278–82 vol.1. Available from: 10.1109/ICDAR.1995.598994 [DOI] [Google Scholar]
- 12.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction [Internet]. Springer Science & Business Media; 2013. 536 p. Available from: https://play.google.com/store/books/details?id=yPfZBwAAQBAJ [Google Scholar]
- 13.Kotti M, Duffell LD, Faisal AA, McGregor AH. Detecting knee osteoarthritis and its discriminating parameters using random forests. Med Eng Phys [Internet]. 2017. May;43:19–29. Available from: 10.1016/j.medengphy.2017.02.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cortes C, Vapnik V. Support-vector networks. Mach Learn [Internet]. 1995. Sep 1;20(3):273–97. Available from: 10.1007/BF00994018 [DOI] [Google Scholar]
- 15.Joachims T. Text categorization with Support Vector Machines: Learning with many relevant features. In: Machine Learning: ECML-98 [Internet]. Springer Berlin Heidelberg; 1998. p. 137–42. Available from: 10.1007/BFb0026683 [DOI] [Google Scholar]
- 16.Yu S, Tan KK, Sng BL, Li S, Sia ATH. Lumbar Ultrasound Image Feature Extraction and Classification with Support Vector Machine. Ultrasound Med Biol [Internet]. 2015. Oct;41(10):2677–89. Available from: 10.1016/j.ultrasmedbio.2015.05.015 [DOI] [PubMed] [Google Scholar]
- 17.Shwartz-Ziv R, Armon A. Tabular Data: Deep Learning is Not All You Need [Internet]. arXiv [cs.LG]. 2021. Available from: http://arxiv.org/abs/2106.03253
- 18.Hinton G, Sejnowski TJ. Unsupervised Learning: Foundations of Neural Computation [Internet]. MIT Press; 1999. 418 p. Available from: https://play.google.com/store/books/details?id=yj04Y0lje4cC [Google Scholar]
- 19.van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol [Internet]. 2014. Dec 22;14:137. Available from: 10.1186/1471-2288-14-137 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.van den Goorbergh R, van Smeden M, Timmerman D, Van Calster B. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. J Am Med Inform Assoc [Internet]. 2022. Jun 10; Available from: 10.1093/jamia/ocac093 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Oversampling Technique. J Artif Intell Res [Internet]. 2002. Jun 1 [cited 2021 Nov 29];16:321–57. Available from: http://www.jair.org/index.php/jair/article/view/10302 [Google Scholar]
- 22.Cawley GC, Talbot NLC. On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res [Internet]. 2010;11:2079–107. Available from: https://www.jmlr.org/papers/volume11/cawley10a/cawley10a [Google Scholar]
- 23.Bergstra J, Bengio Y. Random search for hyper-parameter optimization [Internet]. jmlr.org; 2012. [cited 2021 Oct 25]. Available from: https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a
- 24.Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol [Internet]. 2019. Jun;110:12–22. Available from: 10.1016/j.jclinepi.2019.02.004 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


