. 2019 Nov 28;5(12):e000317. doi: 10.1099/mgen.0.000317

Table 1.

Basic description of data workflow

Whole-genome sequences quality control, mapping to reference, de-novo assembly, annotation
Description	Purpose	Methods
What describes the data?	Features extraction	SNPs, k-mers, proteins, …
What are the most important descriptors?	Features selection	Pan and core GWAS, chi-square, recursive feature elimination algorithms, …
Can descriptors be combined/transformed?	Feature transformation	Scaling and centring, PCA, MDS, t-SNE, auto-encoders, …
Group data by underlying similarities	Unsupervised ML	Phylogeny, k-means, hierarchical clustering, …
Find the hidden patterns in a defined class; classify unknown data	Supervised ML	Random forest, neural networks, SVM, k-nearest-neighbour, …