Table 2.
Dataset | SCMb | CARTb | L1-logistic* | L2-logistic* | Poly-SVM | Naive Bayes | Majority |
---|---|---|---|---|---|---|---|
A. baumannii | 0.849 (2.7) | 0.864 (3.4) | 0.880 (3980.5) | 0.885 (all) | 0.886 (all) | 0.822 (all) | 0.644 |
E. coli | 0.818 (4.6) | 0.808 (7.0) | 0.792 (3727.2) | 0.789 (all) | 0.779 (all) | 0.634 (all) | 0.697 |
E. faecium | 1.000 (1.0) | 1.000 (1.0) | 1.000 (142.0) | 1.000 (all) | 0.996 (all) | 0.808 (all) | 0.588 |
K. pneumoniae | 0.950 (3.9) | 0.949 (4.3) | 0.952 (7607.4) | 0.948 (all) | 0.943 (all) | 0.760 (all) | 0.571 |
M. tuberculosis | 0.963 (4.5) | 0.962 (4.7) | 0.962 (2242.2) | 0.941 (all) | 0.934 (all) | 0.789 (all) | 0.658 |
N. gonorrhoeae | 0.935 (3.0) | 0.936 (3.3) | 0.942 (6095.6) | 0.915 (all) | 0.906 (all) | 0.736 (all) | 0.529 |
P. aeruginosa | 0.939 (1.2) | 0.942 (1.1) | 0.937 (87.8) | 0.828 (all) | 0.773 (all) | 0.768 (all) | 0.588 |
P. difficile | 0.982 (1.0) | 0.982 (1.0) | 0.957 (121.8) | 0.936 (all) | 0.949 (all) | 0.887 (all) | 0.599 |
S. aureus | 0.987 (1.0) | 0.987 (1.0) | 0.988 (230.6) | 0.987 (all) | 0.987 (all) | 0.868 (all) | 0.544 |
S. enterica | 0.913 (1.0) | 0.913 (1.0) | 0.925 (991.2) | 0.929 (all) | 0.920 (all) | 0.759 (all) | 0.709 |
S. haemolyticus | 0.925 (1.0) | 0.925 (1.0) | 0.925 (279.1) | 0.838 (all) | 0.829 (all) | 0.758 (all) | 0.629 |
S. pneumoniae | 0.960 (1.0) | 0.960 (1.0) | 0.948 (1391.5) | 0.949 (all) | 0.946 (all) | 0.910 (all) | 0.654 |
For each dataset the accuracy is shown, along with the number of k-mers used by the model (in parentheses). Results are shown for Set Covering Machines (SCM), Classification trees (CART), Logistic regression with L1 and L2 regularization and χ2 feature selection (L1-logistic, L2-logistic), Polynomial kernel Support Vector Machines (Poly-SVM), Naive Bayes, and a baseline predictor that predicts the most abundant class in the data (Majority). Accuracies within 1% of the maximum value are shown in bold. Results are averaged over ten repetitions of the experiment.
[*] For scalability reasons, these algorithms were trained using feature selection to select the one million k-mers that were most associated with the phenotypes; all other k-mers were discarded (see Methods).