Regression analysis |
Determines the probable expectation of a dependent variable based on training data from an independent variable(s) in a subject sample. Dependent variables can be dichotomous or continuous, depending on the type of regression. |
Every type of regression has assumptions- for instance, about linear/nonlinear relationships between dependent and independent variables. Care must be taken to know these assumptions before application |
Decision tree learning |
Uses decision trees to classify data. Algorithm determines the most informative attribute given a set of observations, and splits the dataset according to this attribute (“divide-and-conquer” algorithm). Process repeated recursively. |
Overfitting is common; prevented by “pruning” algorithms |
Support Vector Machine (SVM) |
Based on linear optimization; subjects are classified in a way that maximizes the “distance” between the observations and a separation hyperplane (hyperplane margin). |
|
k-Nearest Neighbor (kNN) |
Given an unlabeled (test) observation, kNN looks for the k most similar observations in the training cohort. k and the definition of “similarity” are defined by the user. The most represented class of labeled observations from the training cohort is the output |
Relatively simple to construct |
k-means clustering |
Iterative process used to partition data into k clusters. Clusters initiated by picking k centroids, or cluster focal points. Iteration involves assigning new data points to the “closest” centroid (closeness is user-defined), then reweighting each cluster mean to the geometric center of the new cluster. |
Simple to understand and execute. Sensitive to initiation and therefore may change with every execution. Algorithm may fail if clusters aren’t distinct when the process is complete. Optimal k often tested by trial and error |
Artificial neural networks (ANN) |
Models simulate brain organization. “Neurons” (nodes) receive weighted inputs, and output a transfer function. Groups of these building blocks form a network. Training data adjust input weights and build/destroy connections. |
Exhibit complex/nonlinear behavior based on the connection network. Can be used for supervised (involving experts) or unsupervised (automated) learning |
Ensemble learning algorithms |
Learns sets of classifiers and merges their outputs. Classifiers are trained independently on specific sets of training observations. In boosting, each subsequent training set emphasizes importance of training samples that have been problematic for the models that are already part of the ensemble. Some ensembles (e.g. Random Forest) utilize a bagging (bootstrap aggregation) approach. Separate decision trees are learned from independent samples of the training data. Multiple random samples- of subjects or attributes- yield the ensemble of models. |
Robust; can handle small number of samples |