Skip to main content
. 2019 Nov 28;5(12):e000317. doi: 10.1099/mgen.0.000317

Table 1.

Basic description of data workflow

Whole-genome sequences quality control, mapping to reference, de-novo assembly, annotation

Description

Purpose

Methods

What describes the data?

Features extraction

SNPs, k-mers, proteins, …

What are the most important descriptors?

Features selection

Pan and core GWAS, chi-square, recursive feature elimination algorithms, …

Can descriptors be combined/transformed?

Feature transformation

Scaling and centring, PCA, MDS, t-SNE, auto-encoders, …

Group data by underlying similarities

Unsupervised ML

Phylogeny, k-means, hierarchical clustering, …

Find the hidden patterns in a defined class; classify unknown data

Supervised ML

Random forest, neural networks, SVM, k-nearest-neighbour, …