Skip to main content
. Author manuscript; available in PMC: 2022 Jan 10.
Published in final edited form as: Biol Psychiatry Cogn Neurosci Neuroimaging. 2019 Nov 27;5(8):791–798. doi: 10.1016/j.bpsc.2019.11.007

Figure 1. Overview of machine learning components and procedures.

Figure 1.

Different datasets serve distinct purposes in machine learning approaches. The training set is the group of individuals used to build a classifier. The testing set is a group of individuals collected at the same time as the training set, but kept completely separate from training. Reporting performance in a testing set can provide evidence of classifier generalizability. The validation set is a group of individuals collected separately from the training set (e.g., different site, different scanner, separate study). Reporting performance in a validation set provides additional evidence of classifier validity. Each of these datasets comprise features and labels. Features are the multivariate data that, in aggregate, are used to build and make predictions with a classifier. Labels are binary, categorical, or continuous characteristics of individuals that are used to train a classifier and are subsequently predicted. Machine learning procedures involve training (red box) and testing (blue box). Training identifies relationships between multivariate features (e.g., functional connections) and subject labels (e.g., patient vs. control) using a learning algorithm (e.g., support vector machines). The patterns of features that best classify individuals in the training set are then weighted and combined in a resulting classifier. Training can also involve feature selection, data- or hypothesis-driven selection of a reduced set of features. Training procedures should only be performed in the training set (and separately across folds of cross-validation). Testing involves applying the trained classifier to new individuals never used in training. Commonly, classifiers are assessed using k-fold cross-validation. For each fold, a portion of individuals are left out of the training set (left out set) and a classifier is built using the remaining individuals in the training set. The trained classifier is then used to classify the left out set of individuals and, if available, the independent testing and validation sets. Cross-validation can assess whether the performance and feature weights of a classifier depend upon which individuals are in the training set.