Skip to main content
. Author manuscript; available in PMC: 2019 Apr 1.
Published in final edited form as: Annu Rev Public Health. 2017 Dec 20;39:95–112. doi: 10.1146/annurev-publhealth-040617-014208

Table 2.

A glossary of terms used in data science and machine learning for public health researchers and practitioners

Data Science Term Related Public Health Research Term or Concept
Accuracy Proportion of results correctly classified (i.e. (true positives plus true negatives) divided by total number of results predicted)
Data mining Exploratory analysis
Ensemble learning A machine learning approach involving training multiple models on data subsets and combining results from these models when predicting for unobserved inputs
Features Measurements recorded for each observation (e.g. participant age, sex, and BMI are each features)
Label Observed or computed value of an outcome or other variable of interest
Labeling The process of setting a label for a variable, as opposed to leaving the variable’s value unknown
Learning algorithm The set of steps used to train a model automatically from a data set (not to be confused with the model itself, e.g. there are many algorithms to train a neural network, each with different bounds on time, memory and accuracy).
Natural language Working with words as data, as in qualitative or mixed-methods research (generally, human-readable but not readily machine-readable)
Noisy labels Measurement error
Out-of-sample Applying a model fitted to one dataset to make predictions in another
Overfitting Fitting a model to random noise or error instead of the actual relationship (either due to having a small number of observations or a large number of parameters relative to the number of observations)
Pipeline (From bioinformatics) The ordered set of tools applied to a dataset to move it from its raw state to a final interpretable analytic result
Precision Positive predictive value
Recall Sensitivity
Semi-supervised learning An analytic technique used to fit predictive models to data where many observations are missing outcome data.
Small-n, large-p A wide but short dataset: n = number of observations, p= number of variables for each observation
Supervised learning An analytic technique in which patterns in covariates that are correlated with observed outcomes are exploited to predict outcomes in a data set or sets in which the correlates were observed but the outcome was unobserved. For example, linear regression and logistic regression are both supervised learning techniques.
Test dataset A subset of a more complete dataset used to test empirical performance of an algorithm trained on a training dataset
Training Fitting a model
Training dataset A subset of a more complete dataset used to train a model whose empirical performance can be tested on a test dataset
Unsupervised learning An analytic technique in which data is automatically explored to identify patterns, without reference to outcome information. Latent class analysis (when used without covariates) and k-means clustering are unsupervised learning techniques.