Figure 3:
(A) BiG-SLiCE analysis results for a range of threshold values, as measured by the difference of GCF counts (ΔGCF) and the level of clustering agreement (V-score of 1.0 for perfect clustering) compared to MIBiG curated groups. A single threshold result with the lowest ΔGCF while maintaining a V-score > 0.8, T = 1,100, was used as an example for further analysis in this figure. (B) Confusion matrix of BiG-SLiCE clusters vs curated GCFs. To help in visualization, all singletons of the BiG-SLiCE result (58 GCFs) were collapsed into a single column (leftmost column, highlighted in blue box), showing together BGCs requiring a more lenient threshold (T > 1,100) to match the curated information. Conversely, another column, GCF-143 (red box), highlights the need for a stricter threshold (T < 1,100) to obtain a more fine-grained clustering for some parts of sequence space. (C) BGC-to-centroid distance value (i.e., radius) distribution of within- and between-group pairs in the curated dataset. The centroid of each curated group was calculated by averaging the feature vectors of all BGCs assigned to it. (D) Feature heat map of the collapsed singleton group and GCF-143. Colored bars on the left indicate manually curated groups. In both cases, hierarchical clustering analysis (Euclidean-based, average-linkage) shows that the underlying pattern captured by BiG-SLiCE features tends to agree with the manually curated information; i.e., rows with the same color tend to be located near each other.