Table 3.
Publically available R packages for machine learning.
| Algorithm Name | R package/function |
|---|---|
| Decision treea | rpart |
| Random forestb | randomForest |
| k-Nearest Neighbour Classifier | knn R function contained in Class package. |
| Naive Bayes Classifier | naiveBayes R function contained in e1071 package |
| Neural network | nnet R function contained in nnet package |
| Support vector machinec | ksvm R function contained in the kernlab package |
| t-Distributed Stochastic Neighbour Embedding (t-SNE) | Rtsne R package—an R implementation of the t-SNE dimensionality reduction procedure |
| GLM (generalized linear model)-based Ordination Method for Microbiome Samples (GOMMS) | gomms R package—an R implementation of the GOMMS ordination reduction method |
| Agglomerative nested (AGNES) clustering and other clustering methods | The R packages agnes and cluster are R implementations of various popular clustering methods. Additionally, hclust which is part of the core R stats package includes some implementations of popular clustering procedures. |
Notable arguments for the rpart function are method = class to build classification model and parms = list(split = ‘information’) to use an information gain formula for deciding between alternative splits (a different formula that can be used is based on the Gini index of diversity).
randomForest function allows a user to vary either or both the number of decision trees and the number of variables to try at each split in the multiple decision trees. There is an extractor function called importance contained in the randomForest package that measures variable importance with respect to a generated RF model. The average out-of-bag (OOB) estimate of the error can be calculated for multiple runs. The error is an indication that when the resulting model is applied to new data, the classification predictions are expected to be in error by a similar amount.
Package provides several kernels (e.g. Radial Basis ‘Gaussian’, polynomial, linear, hyperbolic tangent, Laplacian, Bessel, ANOVA RBF and spline) that transform the data into high-dimensional feature space. There are also several model types (e.g. C, nu and bound-constraint classifications), which determine the hyperplane.