Skip to main content

View full-text article in PMC

. 2021 Oct 7;1(1):e28. doi: 10.1017/ash.2021.192

Table 1.

Relevant Machine Learning Terms

Term	Definition
Artificial intelligence (AI)	A computer’s ability to learn from experience
Machine learning (ML)	A type of artificial intelligence in which computers draw conclusions from data without being directly programmed
Supervised learning	Models in which the outcome is known for each observation
Unsupervised learning	Models in which the outcome is not known for each observation
Semisupervised learning	Models in which the outcome is known for some observations but not others
Label	The patient outcome (dependent variable)
Feature	An attribute/characteristic of the patient (dependent variable)
Sensitivity	Ability of a model to correctly identify true cases
Specificity	Ability of a model to directly identify negative cases
Accuracy	Measure of correctly labeled data instances over the total number of instances
Precision	Fraction of relevant instances among the retrieved instances (ie, positive predictive value)
Recall	Fraction of relevant instances that were retrieved correctly (ie, sensitivity)
Training data set	Data used to develop a model
Validation data set	Data used to test a model’s performance while training
Test data set	Data used to test the accuracy, precision, or recall against real-world data
Out-of-sample data	In a study cohort, the data not used as training data
Bias-variance tradeoff	In supervised learning, overfitting and underfitting can result in loss of performance
Bias	Difference between the average prediction of a model and the correct value
Variance	Variability of a model prediction for a given data point
Overfitting	When the model follows noise, resulting in low bias and high variance
Noise	Nonpredictive features in the data set
Underfitting	When the model fails to capture the underlying patterns in the data, resulting in low variance and high bias
Decision tree	A model that separates data into smaller and smaller partitions until each observation is classified according to the outcome of interest
Stopping criteria	Criteria used to stop further partitioning of data in a decision tree. Can prevent overfitting
Ensemble model	An ML technique combining multiple individual models
Random forest	A type of ensemble model that combines decision trees to produce a probabilistic prediction for the outcome
Receiver operator characteristic (ROC) curve	A way to graph the sensitivity and specificity (or precision) of a model
Area under the curve (AUC)	A technique to compare model results (with other models or other measurement tools) by calculating the area under an ROC curve
Natural language processing (NLP)	A type of AI in which the algorithm learns how to ‘understand’ language, including contextual nuances