Skip to main content
. 2020 Dec 8;11:6293. doi: 10.1038/s41467-020-19612-0

Fig. 3. Simple machine-learning methods can accurately attribute nation-of-origin of engineered DNA.

Fig. 3

a Domestic vs. International binary classification accuracy, with the United States defined as domestic of BLAST, Random Forests (RF), a baseline of predicting the most abundant class from the training set and guessing uniformly randomly. Random Forests reach 84% accuracy. b Receiver-operating characteristic (ROC) curve for the RF model, with the area under the curve (AUC) shown. c Top-k test-set accuracy of multiclass classification of nation-of-origin (excluding the United States, which is classified in (a, b)) for RF, BLAST, and a baseline of predicting the most abundant class from the training set, and guessing uniformly randomly. With only 33 countries to choose between, guessing based on top ten abundance reaches 83.8%. * indicates P < 10−10, by Welch’s two-sided t test on 30 × 50% bootstrap replicates compared to BLAST. d Test-set accuracy of RF on the multiclass classification of nation-of-origin, as in (c). Colored by prediction accuracy within each nation class. Enlarged Europe and East Asia shown above. e Nation-specific test-set prediction accuracy correlates with the log10 of the number of training examples. Shown in green is the mean of 30× bootstrap replicates subsampling 50% of the test-set examples for each class and computing the prediction accuracy within that subset. Error bars represent standard deviations of the same.