Accuracy |
Proportion of results correctly classified (i.e. (true positives plus true negatives) divided by total number of results predicted) |
Data mining |
Exploratory analysis |
Ensemble learning |
A machine learning approach involving training multiple models on data subsets and combining results from these models when predicting for unobserved inputs |
Features |
Measurements recorded for each observation (e.g. participant age, sex, and BMI are each features) |
Label |
Observed or computed value of an outcome or other variable of interest |
Labeling |
The process of setting a label for a variable, as opposed to leaving the variable’s value unknown |
Learning algorithm |
The set of steps used to train a model automatically from a data set (not to be confused with the model itself, e.g. there are many algorithms to train a neural network, each with different bounds on time, memory and accuracy). |
Natural language |
Working with words as data, as in qualitative or mixed-methods research (generally, human-readable but not readily machine-readable) |
Noisy labels |
Measurement error |
Out-of-sample |
Applying a model fitted to one dataset to make predictions in another |
Overfitting |
Fitting a model to random noise or error instead of the actual relationship (either due to having a small number of observations or a large number of parameters relative to the number of observations) |
Pipeline |
(From bioinformatics) The ordered set of tools applied to a dataset to move it from its raw state to a final interpretable analytic result |
Precision |
Positive predictive value |
Recall |
Sensitivity |
Semi-supervised learning |
An analytic technique used to fit predictive models to data where many observations are missing outcome data. |
Small-n, large-p |
A wide but short dataset: n = number of observations, p= number of variables for each observation |
Supervised learning |
An analytic technique in which patterns in covariates that are correlated with observed outcomes are exploited to predict outcomes in a data set or sets in which the correlates were observed but the outcome was unobserved. For example, linear regression and logistic regression are both supervised learning techniques. |
Test dataset |
A subset of a more complete dataset used to test empirical performance of an algorithm trained on a training dataset |
Training |
Fitting a model |
Training dataset |
A subset of a more complete dataset used to train a model whose empirical performance can be tested on a test dataset |
Unsupervised learning |
An analytic technique in which data is automatically explored to identify patterns, without reference to outcome information. Latent class analysis (when used without covariates) and k-means clustering are unsupervised learning techniques. |
|
|