Skip to main content
. 2020 Jan 21;5(1):e00774-19. doi: 10.1128/mSystems.00774-19

TABLE 1.

Application of random forest models to the differently represented E. coli and M. tuberculosis data setsa

Species (no. of isolates) and
data representation method
(no. of features)
AUC
Validation data Test data
E. coli (1,694)
    Binary representation (1,119) 0.98 ± 0.01 0.97
    Scored representation (2,167) 0.98 ± 0.01 0.97
    Scored + binary representation (4,219) 0.98 ± 0.01 0.98
    Amino acid representation (52,199) 0.98 ± 0.01 0.97
    Nucleotide representation (14,483) 0.98 ± 0.02 0.97
M. tuberculosis (1,785)
    Binary representation (6,735) 0.94 ± 0.04 0.92
    Scored representation (11,120) 0.94 ± 0.04 0.92
    Scored + binary representation (21,975) 0.94 ± 0.04 0.92
    Amino acid representation (261,085) 0.93 ± 0.04 0.92
    Nucleotide representation (87,205) 0.93 ± 0.04 0.92
a

For the performances with E. coli, the model was trained and validated with 1,422 isolates and tested with 272 isolates. For the performances with M. tuberculosis, the model was trained and validated with 992 isolates and tested with 793 isolates. All of these M. tuberculosis isolates had complete resistance profiles. AUC, area under the curve.