Skip to main content
. 2021 Oct 7;1(1):e28. doi: 10.1017/ash.2021.192

Table 1.

Relevant Machine Learning Terms

Term Definition
Artificial intelligence (AI) A computer’s ability to learn from experience
Machine learning (ML) A type of artificial intelligence in which computers draw conclusions from data without being directly programmed
Supervised learning Models in which the outcome is known for each observation
Unsupervised learning Models in which the outcome is not known for each observation
Semisupervised learning Models in which the outcome is known for some observations but not others
Label The patient outcome (dependent variable)
Feature An attribute/characteristic of the patient (dependent variable)
Sensitivity Ability of a model to correctly identify true cases
Specificity Ability of a model to directly identify negative cases
Accuracy Measure of correctly labeled data instances over the total number of instances
Precision Fraction of relevant instances among the retrieved instances (ie, positive predictive value)
Recall Fraction of relevant instances that were retrieved correctly (ie, sensitivity)
Training data set Data used to develop a model
Validation data set Data used to test a model’s performance while training
Test data set Data used to test the accuracy, precision, or recall against real-world data
Out-of-sample data In a study cohort, the data not used as training data
Bias-variance tradeoff In supervised learning, overfitting and underfitting can result in loss of performance
Bias Difference between the average prediction of a model and the correct value
Variance Variability of a model prediction for a given data point
Overfitting When the model follows noise, resulting in low bias and high variance
Noise Nonpredictive features in the data set
Underfitting When the model fails to capture the underlying patterns in the data, resulting in low variance and high bias
Decision tree A model that separates data into smaller and smaller partitions until each observation is classified according to the outcome of interest
Stopping criteria Criteria used to stop further partitioning of data in a decision tree. Can prevent overfitting
Ensemble model An ML technique combining multiple individual models
Random forest A type of ensemble model that combines decision trees to produce a probabilistic prediction for the outcome
Receiver operator characteristic (ROC) curve A way to graph the sensitivity and specificity (or precision) of a model
Area under the curve (AUC) A technique to compare model results (with other models or other measurement tools) by calculating the area under an ROC curve
Natural language processing (NLP) A type of AI in which the algorithm learns how to ‘understand’ language, including contextual nuances