Table 1.
Whole-genome sequences quality control, mapping to reference, de-novo assembly, annotation | ||
---|---|---|
Description |
Purpose |
Methods |
What describes the data? |
Features extraction |
SNPs, k-mers, proteins, … |
What are the most important descriptors? |
Features selection |
Pan and core GWAS, chi-square, recursive feature elimination algorithms, … |
Can descriptors be combined/transformed? |
Feature transformation |
Scaling and centring, PCA, MDS, t-SNE, auto-encoders, … |
Group data by underlying similarities |
Unsupervised ML |
Phylogeny, k-means, hierarchical clustering, … |
Find the hidden patterns in a defined class; classify unknown data |
Supervised ML |
Random forest, neural networks, SVM, k-nearest-neighbour, … |