Table 2.

A glossary of terms used in data science and machine learning for public health researchers and practitioners

Data Science Term	Related Public Health Research Term or Concept
Accuracy	Proportion of results correctly classified (i.e. (true positives plus true negatives) divided by total number of results predicted)
Data mining	Exploratory analysis
Ensemble learning	A machine learning approach involving training multiple models on data subsets and combining results from these models when predicting for unobserved inputs
Features	Measurements recorded for each observation (e.g. participant age, sex, and BMI are each features)
Label	Observed or computed value of an outcome or other variable of interest
Labeling	The process of setting a label for a variable, as opposed to leaving the variable’s value unknown
Learning algorithm	The set of steps used to train a model automatically from a data set (not to be confused with the model itself, e.g. there are many algorithms to train a neural network, each with different bounds on time, memory and accuracy).
Natural language	Working with words as data, as in qualitative or mixed-methods research (generally, human-readable but not readily machine-readable)
Noisy labels	Measurement error
Out-of-sample	Applying a model fitted to one dataset to make predictions in another
Overfitting	Fitting a model to random noise or error instead of the actual relationship (either due to having a small number of observations or a large number of parameters relative to the number of observations)
Pipeline	(From bioinformatics) The ordered set of tools applied to a dataset to move it from its raw state to a final interpretable analytic result
Precision	Positive predictive value
Recall	Sensitivity
Semi-supervised learning	An analytic technique used to fit predictive models to data where many observations are missing outcome data.
Small-n, large-p	A wide but short dataset: n = number of observations, p= number of variables for each observation
Supervised learning	An analytic technique in which patterns in covariates that are correlated with observed outcomes are exploited to predict outcomes in a data set or sets in which the correlates were observed but the outcome was unobserved. For example, linear regression and logistic regression are both supervised learning techniques.
Test dataset	A subset of a more complete dataset used to test empirical performance of an algorithm trained on a training dataset
Training	Fitting a model
Training dataset	A subset of a more complete dataset used to train a model whose empirical performance can be tested on a test dataset
Unsupervised learning	An analytic technique in which data is automatically explored to identify patterns, without reference to outcome information. Latent class analysis (when used without covariates) and k-means clustering are unsupervised learning techniques.