Table 3.
Performance of the TransClust algorithm in super-family and family clustering for all protein similarity computation methods with Precision, Recall and F-measure values
Super-family |
Family |
||||||||
---|---|---|---|---|---|---|---|---|---|
No. Clusters | Precision | Recall | F-measure | No. Clusters | Precision | Recall | F-measure | ||
Protein sequence based | |||||||||
Blast (e-value) | A-50 | 1596 | 0.997 | 0.261 | 0.350 | 1636 | 1.0 | 0.399 | 0.500 |
Blast (identity) | A-50 | 606 | 0.861 | 0.550 | 0.595 | 660 | 0.781 | 0.668 | 0.631 |
Protein Word frequency | A-50 | 708 | 0.952 | 0.621 | 0.686 | 688 | 0.844 | 0.777 | 0.744 |
ProtVec Avg (word) | A-50 | 655 | 0.927 | 0.620 | 0.681 | 704 | 0.845 | 0.757 | 0.739 |
ProtVec Avg (char) | A-50 | 707 | 0.940 | 0.603 | 0.674 | 707 | 0.842 | 0.746 | 0.729 |
ProtVec MinMax (word) | A-50 | 586 | 0.891 | 0.623 | 0.667 | 704 | 0.829 | 0.741 | 0.718 |
Ligand based | |||||||||
SMILES Word frequency | A-50 | 801 | 0.951 | 0.548 | 0.624 | 957 | 0.934 | 0.658 | 0.704 |
SMILESVec (word, chembl) | A-50 | 621 | 0.921 | 0.621 | 0.677 | 730 | 0.855 | 0.744 | 0.735 |
SMILESVec (word, pubchem) | A-50 | 573 | 0.888 | 0.627 | 0.668 | 692 | 0.839 | 0.751 | 0.730 |
SMILESVec (word, combined) | A-50 | 617 | 0.923 | 0.627 | 0.675 | 764 | 0.873 | 0.732 | 0.735 |
SMILESVec (char, chembl) | A-50 | 636 | 0.920 | 0.621 | 0.678 | 710 | 0.844 | 0.743 | 0.729 |
SMILESVec (char, pubchem) | A-50 | 714 | 0.941 | 0.600 | 0.671 | 715 | 0.845 | 0.744 | 0.729 |
SMILESVec (char, combined) | A-50 | 712 | 0.949 | 0.602 | 0.675 | 712 | 0.850 | 0.749 | 0.739 |
MACCS | A-50 | 589 | 0.909 | 0.629 | 0.679 | 683 | 0.839 | 0.757 | 0.736 |
ECFP6 | A-50 | 611 | 0.917 | 0.627 | 0.679 | 725 | 0.860 | 0.746 | 0.733 |
Note: The best F-measure values for the Protein sequence- and Ligand-based methods are shown in bold.