TABLE 3.
Data set name | Species | Phenotype(s) and split | Reference | No. of samples | No. of samples for training/test | No. of genetic features |
---|---|---|---|---|---|---|
TB | Mycobacterium tuberculosis | First-line antibiotic resistance: rifampicin, 1,285:2,257; isoniazid, 1,553:2,011; pyrazinamide, 702: 2,445; ethambutol, 975:2,551 | 5 | 3,566 | 2,377/1,189 | 6,400 (SNPs) |
N. gonorrhoeae | Neisseria gonorrhoeae | Antibiotic resistance MICs: azithromycin, cefixime, ciprofloxacin, penicillin, and tetracycline | 53, 61, 83, 84 | 1,595 | NUb | 550,000 (unitigs) |
GAS | Streptococcus pyogenes | Virulence, 1,093:637 | 46 | 1,730 | 1,154/576 | 1.1 million (unitigs) |
SPARC | Streptococcus pneumoniae | Antibiotic resistance MICs: penicillin, erythromycin | 47, 85 | 603 | 400/203 | 90,000 (SNPs), 730,000 (unitigs), 10 million (k-mers) |
Maela | Streptococcus pneumoniae | Carriage duration; antibiotic resistance: penicillin, 1,661:1,282; erythromycin, 802:2,355; trimethoprim, 609:2,548 | 12, 44 | 3,162 (antibiotic resistance), 2,017 (carriage duration) | 1,404/703 (carriage duration) | 121,000 (SNPs), 1.6 million (unitigs) |
GPS | Streptococcus pneumoniae | Antibiotic resistance (penicillin) | 1 | 5,820 | NU | 1.7 million (unitigs) |
Netherlands | Streptococcus pneumoniae | Meningitis/carriage, 693:1,144 | 45 | 1,837 | 1,225/612 | 690,000 (unitigs) |
Each data set has a name by which it is referred to in the text. Most data sets have multiple phenotypes available, especially where multiple different antibiotic resistances are routinely phenotyped. Data sets without a training/test split were not evaluated for internal prediction ability as they were instead used with more stringent external validation data sets or were used for GWAS only, and all available samples were used to fit the model.
NU, not used.