Figure 2. Global prediction performance on enzymes with known function.
(A) Tuning of FDR-corrected p-value threshold on validation set. Performance in terms of precision, recall, and F1 score at each threshold are compared to PARSE and DeepFRI. (B) Precision, recall, and F1 score for each method on held-out test set of rare enzyme classes. Validation set performance is shown in the hatched bars for comparison. (C) Function-centric AUPRC for each method depending on the frequency of each enzyme class (EC number) in Swissprot. Each dot represents the AUPRC for a single class, and error bars represent standard error. Enzyme classes with more than 100 examples are in the validation set, shown again using hatched bars. (D) Analysis of which enzyme classes are able to be predicted by each method. On the left, the upset plot shows all intersections between the unique EC numbers predicted correctly (AUPRC > 0) for each method. The marginal size of each set is shown by the histograms on each axis. (E) EC numbers predicted correctly by PARSE only, including the products, reactants, and cofactors involved in the reaction (orange in (E)). (F) Error analysis for four sampled functions which were predicted correctly by CLEAN but not PARSE (green in (E)). Products, reactants, and cofactors which are shared between ground truth and prediction are highlighted in yellow and orange.