Skip to main content
. 2021 Mar 3;7(3):mgen000531. doi: 10.1099/mgen.0.000531

Table 1.

Quality control metrics of the bioinformatics workflow

Metric

Definition

Warning threshold

Failure threshold

Contamination

Percentage of reads classified as highest occurring in species other than E. coli

1 %

5 %

Median coverage against assembly

Median coverage based on mapping of the trimmed reads against the assembled contigs

20

10

% cgMLST genes identified

Percentage of cgMLST genes identified. Only perfect hits (i.e. full length and 100 % identity) are considered [85]

95

90

Average read quality (Q-score)

Q-score of the trimmed reads averaged over all reads and positions

30

25

GC-content deviation

Deviation of the average GC content of the trimmed reads from the expected value for E. coli (50.5% [86])

2 %

4 %

N-content

Average N-fraction per read position of the trimmed reads, expressed as a percentage

0.5 %

1 %

Per base sequence content

Difference between AT and GC frequencies averaged at every read position. Since primer artefacts can cause fluctuations at the start of reads due to the non-random nature of enzymatic tagmentation when the Nextera XT protocol is used for library preparation, the first 20 bases are not included in this test. As fluctuations can also exist at the end of reads caused by the low abundance of very long reads because of read trimming, the 0.5 % longest reads are similarly excluded

3 %

6 %

Minimum read length

Minimum read length after trimming (denoted as a percentage of untrimmed read length) that a minimum of half of all trimmed reads must obtain (e.g. half of all trimmed reads should either be minimally 120 or 200 bases long when raw input reads lengths are 300 bases long)

66.67 %

40.00 %