. 2019 Mar 11;9:4071. doi: 10.1038/s41598-019-40561-2

Table 2.

Comparison to state-of-the-art classifiers in terms of accuracy and model complexity.

Dataset	SCM_b	CART_b	L1-logistic*	L2-logistic*	Poly-SVM	Naive Bayes	Majority
A. baumannii	0.849 (2.7)	0.864 (3.4)	0.880 (3980.5)	0.885 (all)	0.886 (all)	0.822 (all)	0.644
E. coli	0.818 (4.6)	0.808 (7.0)	0.792 (3727.2)	0.789 (all)	0.779 (all)	0.634 (all)	0.697
E. faecium	1.000 (1.0)	1.000 (1.0)	1.000 (142.0)	1.000 (all)	0.996 (all)	0.808 (all)	0.588
K. pneumoniae	0.950 (3.9)	0.949 (4.3)	0.952 (7607.4)	0.948 (all)	0.943 (all)	0.760 (all)	0.571
M. tuberculosis	0.963 (4.5)	0.962 (4.7)	0.962 (2242.2)	0.941 (all)	0.934 (all)	0.789 (all)	0.658
N. gonorrhoeae	0.935 (3.0)	0.936 (3.3)	0.942 (6095.6)	0.915 (all)	0.906 (all)	0.736 (all)	0.529
P. aeruginosa	0.939 (1.2)	0.942 (1.1)	0.937 (87.8)	0.828 (all)	0.773 (all)	0.768 (all)	0.588
P. difficile	0.982 (1.0)	0.982 (1.0)	0.957 (121.8)	0.936 (all)	0.949 (all)	0.887 (all)	0.599
S. aureus	0.987 (1.0)	0.987 (1.0)	0.988 (230.6)	0.987 (all)	0.987 (all)	0.868 (all)	0.544
S. enterica	0.913 (1.0)	0.913 (1.0)	0.925 (991.2)	0.929 (all)	0.920 (all)	0.759 (all)	0.709
S. haemolyticus	0.925 (1.0)	0.925 (1.0)	0.925 (279.1)	0.838 (all)	0.829 (all)	0.758 (all)	0.629
S. pneumoniae	0.960 (1.0)	0.960 (1.0)	0.948 (1391.5)	0.949 (all)	0.946 (all)	0.910 (all)	0.654

For each dataset the accuracy is shown, along with the number of k-mers used by the model (in parentheses). Results are shown for Set Covering Machines (SCM), Classification trees (CART), Logistic regression with L1 and L2 regularization and χ² feature selection (L1-logistic, L2-logistic), Polynomial kernel Support Vector Machines (Poly-SVM), Naive Bayes, and a baseline predictor that predicts the most abundant class in the data (Majority). Accuracies within 1% of the maximum value are shown in bold. Results are averaged over ten repetitions of the experiment.

[*] For scalability reasons, these algorithms were trained using feature selection to select the one million k-mers that were most associated with the phenotypes; all other k-mers were discarded (see Methods).