. 2020 Jan 21;5(1):e00774-19. doi: 10.1128/mSystems.00774-19

TABLE 1.

Application of random forest models to the differently represented E. coli and M. tuberculosis data sets^a

Species (no. of isolates) and data representation method (no. of features)	AUC
	Validation data	Test data
E. coli (1,694)
Binary representation (1,119)	0.98 ± 0.01	0.97
Scored representation (2,167)	0.98 ± 0.01	0.97
Scored + binary representation (4,219)	0.98 ± 0.01	0.98
Amino acid representation (52,199)	0.98 ± 0.01	0.97
Nucleotide representation (14,483)	0.98 ± 0.02	0.97
M. tuberculosis (1,785)
Binary representation (6,735)	0.94 ± 0.04	0.92
Scored representation (11,120)	0.94 ± 0.04	0.92
Scored + binary representation (21,975)	0.94 ± 0.04	0.92
Amino acid representation (261,085)	0.93 ± 0.04	0.92
Nucleotide representation (87,205)	0.93 ± 0.04	0.92

For the performances with E. coli, the model was trained and validated with 1,422 isolates and tested with 272 isolates. For the performances with M. tuberculosis, the model was trained and validated with 992 isolates and tested with 793 isolates. All of these M. tuberculosis isolates had complete resistance profiles. AUC, area under the curve.